Yi-En Tseng, Aug 9th 2023
Feature Engineering: Perform the weight-of-evidence (WOE) transformation for the above variables according to "A Data Scientist’s Toolkit to Encode Categorical Variables to NumericLinks to an external site.".
Build a simple decision tree model or a logistic regression model with the above variables.
Build the RF model and experiment at least two sampling methods (under-sampling or over-sampling techniques).
Build (1) the GBM (Gradient Boosting Machine) model and (2) the Deep Learning model.
Build (1) the GLM model and (2) the autoML model
The criteria include ROC and the cumulative Lift. Make sure you read the H2O documentationLinks to an external site. for the hyper-parameters to test accordingly. You also can select or drop the variables to improve the model performance.
Var dtypes description Var Category
AP001 Numeric YR_AGE Application
AP003 Numeric CODE_EDUCATION Application
AP008 Numeric FLAG_IP_CITY_NOT_APPL_CITY Application
CR009 Numeric AMT_LOAN_TOTAL Credit Bureau
CR015 Numeric MONTH_CREDIT_CARD_MOB_MAX Credit Bureau
CR019 Numeric SCORE_SINGLE_DEBIT_CARD_LIMIT Credit Bureau
PA022 Numeric DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_OR_HIGH_RISK_CALL Call Detail
PA023 Numeric DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_CALL Call Detail
PA029 Numeric AVG_LEN_COLLECTION_OR_HIGH_RISK_INBOUND_CALLS Call Detail
TD001 Numeric TD_CNT_QUERY_LAST_7Day_P2P Credit Center
TD002 Numeric TD_CNT_QUERY_LAST_7Day_SMALL_LOAN Credit Center
TD006 Numeric TD_CNT_QUERY_LAST_1MON_SMALL_LOAN Credit Center
TD009 Numeric TD_CNT_QUERY_LAST_3MON_P2P Credit Center
TD010 Numeric TD_CNT_QUERY_LAST_3MON_SMALL_LOAN Credit Center
TD014 Numeric TD_CNT_QUERY_LAST_6MON_SMALL_LOAN Credit Center
import pandas as pd
#path = '/Users/yientseng/Desktop/Classes/APAN 5420/L3/'
#df = pd.read_csv(path + 'XYZloan_default_selected_vars.csv')
df = pd.read_csv('XYZloan_default_selected_vars.csv')
df.head(5)
| Unnamed: 0.1 | Unnamed: 0 | id | loan_default | AP001 | AP002 | AP003 | AP004 | AP005 | AP006 | ... | CD162 | CD164 | CD166 | CD167 | CD169 | CD170 | CD172 | CD173 | MB005 | MB007 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 1 | 31 | 2 | 1 | 12 | 2017/7/6 10:21 | ios | ... | 13.0 | 13.0 | 0.0 | 0.0 | 1449.0 | 1449.0 | 2249.0 | 2249.0 | 7.0 | IPHONE7 |
| 1 | 1 | 2 | 2 | 0 | 27 | 1 | 1 | 12 | 2017/4/6 12:51 | h5 | ... | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | NaN | WEB |
| 2 | 2 | 3 | 3 | 0 | 33 | 1 | 4 | 12 | 2017/7/1 14:11 | h5 | ... | 3.0 | 2.0 | 33.0 | 0.0 | 33.0 | 0.0 | 143.0 | 110.0 | 8.0 | WEB |
| 3 | 3 | 4 | 4 | 0 | 34 | 2 | 4 | 12 | 2017/7/7 10:10 | android | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 | OPPO |
| 4 | 4 | 5 | 5 | 0 | 47 | 2 | 1 | 12 | 2017/7/6 14:37 | h5 | ... | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | -99.0 | NaN | WEB |
5 rows × 89 columns
columns_to_keep = ['id','loan_default','AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006','TD009', 'TD010', 'TD014']
df = df[columns_to_keep]
df.shape
df.describe()
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 80000.000000 | 80000.000000 | 80000.000000 | 80000.000000 | 80000.000000 | 8.000000e+04 | 80000.000000 | 80000.000000 | 79619.000000 | 79619.000000 | 79619.000000 | 80000.000000 | 80000.000000 | 80000.000000 | 80000.00000 | 80000.000000 | 80000.000000 |
| mean | 40000.500000 | 0.193600 | 31.706913 | 2.014925 | 3.117200 | 3.518711e+04 | 4.924750 | 6.199038 | 19.298811 | 14.828822 | -42.407356 | 1.986962 | 3.593037 | 1.345700 | 5.40600 | 2.020812 | 2.603662 |
| std | 23094.155105 | 0.395121 | 7.075070 | 1.196806 | 1.306335 | 6.359684e+04 | 1.094305 | 3.359354 | 39.705478 | 37.009374 | 97.006168 | 1.807445 | 2.799570 | 1.413362 | 4.02311 | 1.973988 | 2.505840 |
| min | 1.000000 | 0.000000 | 20.000000 | 1.000000 | 1.000000 | 0.000000e+00 | 2.000000 | 1.000000 | -99.000000 | -99.000000 | -99.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 25% | 20000.750000 | 0.000000 | 27.000000 | 1.000000 | 2.000000 | 4.700000e+03 | 5.000000 | 3.000000 | -1.000000 | -1.000000 | -98.000000 | 1.000000 | 2.000000 | 0.000000 | 3.00000 | 1.000000 | 1.000000 |
| 50% | 40000.500000 | 0.000000 | 30.000000 | 1.000000 | 3.000000 | 1.728500e+04 | 5.000000 | 5.000000 | -1.000000 | -1.000000 | -98.000000 | 2.000000 | 3.000000 | 1.000000 | 4.00000 | 2.000000 | 2.000000 |
| 75% | 60000.250000 | 0.000000 | 35.000000 | 3.000000 | 4.000000 | 4.075000e+04 | 6.000000 | 10.000000 | 41.000000 | 14.000000 | 26.000000 | 3.000000 | 5.000000 | 2.000000 | 7.00000 | 3.000000 | 4.000000 |
| max | 80000.000000 | 1.000000 | 56.000000 | 6.000000 | 5.000000 | 1.420300e+06 | 6.000000 | 12.000000 | 448.000000 | 448.000000 | 2872.000000 | 20.000000 | 24.000000 | 21.000000 | 46.00000 | 35.000000 | 43.000000 |
AP001_type = df.dtypes['AP001']
AP003_type = df.dtypes['AP003']
AP008_type = df.dtypes['AP008']
CR009_type = df.dtypes['CR009']
CR015_type = df.dtypes['CR009']
CR019_type = df.dtypes['CR009']
PA022_type = df.dtypes['PA022']
PA023_type = df.dtypes['PA023']
PA029_type = df.dtypes['PA029']
TD001_type = df.dtypes['TD001']
TD005_type = df.dtypes['TD005']
TD006_type = df.dtypes['TD006']
TD009_type = df.dtypes['TD009']
TD010_type = df.dtypes['TD010']
TD014_type = df.dtypes['TD014']
print(AP001_type, AP003_type, AP008_type,CR009_type, CR015_type, CR019_type,PA022_type, PA023_type, PA029_type)
print(TD001_type, TD005_type, TD006_type, TD009_type, TD010_type, TD014_type)
int64 int64 int64 int64 int64 int64 float64 float64 float64 int64 int64 int64 int64 int64 int64
#Examine missing data, only in 3 variables 'PA022', 'PA023', 'PA029'
#is_missing_PA022 = df['PA022'].isna().any()
#is_missing_PA022 TRUE
#is_missing_PA023 = df['PA023'].isna().any()
#is_missing_PA023 TRUE
#is_missing_PA029 = df['PA029'].isna().any()
#is_missing_PA029 TRUE
variables = ['AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006','TD009', 'TD010', 'TD014']
target_variable = 'loan_default'
for var in variables:
# Calculating the average loan_default for different values of X
avg_loan_default_by_X = df.groupby(var)[target_variable].mean()
print(f'Average {target_variable} by {var}:')
print(avg_loan_default_by_X)
print('\n')
Average loan_default by AP001:
AP001
20 0.221239
21 0.264848
22 0.208487
23 0.204638
24 0.200047
25 0.204809
26 0.201072
27 0.211450
28 0.198194
29 0.190930
30 0.197074
31 0.194540
32 0.189284
33 0.188460
34 0.187583
35 0.178831
36 0.178732
37 0.173545
38 0.182713
39 0.181269
40 0.194564
41 0.180445
42 0.183333
43 0.179156
44 0.181818
45 0.165138
46 0.178610
47 0.192771
48 0.169839
49 0.166124
50 0.185819
51 0.173913
52 0.193309
53 0.135036
54 0.155797
55 0.206061
56 0.160000
Name: loan_default, dtype: float64
Average loan_default by AP003:
AP003
1 0.221034
3 0.173948
4 0.125853
5 0.060345
6 0.000000
Name: loan_default, dtype: float64
Average loan_default by AP008:
AP008
1 0.168286
2 0.179188
3 0.195604
4 0.209325
5 0.209394
Name: loan_default, dtype: float64
Average loan_default by CR009:
CR009
0 0.171687
50 0.000000
99 0.000000
100 0.000000
150 0.000000
...
1353000 1.000000
1368505 1.000000
1381000 0.000000
1381800 0.000000
1420300 0.000000
Name: loan_default, Length: 25883, dtype: float64
Average loan_default by CR015:
CR015
2 0.188389
3 0.247678
4 0.218583
5 0.207024
6 0.154864
Name: loan_default, dtype: float64
Average loan_default by CR019:
CR019
1 0.221311
2 0.220759
3 0.213964
4 0.212296
5 0.196685
6 0.179220
7 0.195236
8 0.182716
9 0.163152
10 0.177083
11 0.165039
12 0.163088
Name: loan_default, dtype: float64
Average loan_default by PA022:
PA022
-99.0 0.149935
-1.0 0.171054
0.0 0.193103
1.0 0.296117
2.0 0.227907
...
437.0 0.000000
440.0 0.000000
441.0 0.000000
445.0 1.000000
448.0 0.000000
Name: loan_default, Length: 172, dtype: float64
Average loan_default by PA023:
PA023
-99.0 0.149935
-1.0 0.175095
0.0 0.162393
1.0 0.273256
2.0 0.257310
...
434.0 1.000000
440.0 0.000000
441.0 0.000000
445.0 1.000000
448.0 0.000000
Name: loan_default, Length: 167, dtype: float64
Average loan_default by PA029:
PA029
-99.00 0.149935
-98.00 0.173775
0.00 0.288136
1.00 0.136364
1.50 0.000000
...
1757.00 0.000000
1767.75 0.000000
1919.00 0.000000
2014.00 1.000000
2872.00 0.000000
Name: loan_default, Length: 4120, dtype: float64
Average loan_default by TD001:
TD001
0 0.156904
1 0.163815
2 0.197216
3 0.213688
4 0.236021
5 0.259870
6 0.277253
7 0.278652
8 0.328228
9 0.302419
10 0.259259
11 0.369048
12 0.288889
13 0.400000
14 0.466667
15 0.555556
16 0.166667
17 0.000000
18 0.500000
19 0.750000
20 1.000000
Name: loan_default, dtype: float64
Average loan_default by TD005:
TD005
0 0.132324
1 0.126238
2 0.163685
3 0.188810
4 0.201861
5 0.227266
6 0.244974
7 0.268191
8 0.265170
9 0.299129
10 0.316881
11 0.290634
12 0.332613
13 0.322884
14 0.380734
15 0.387324
16 0.371134
17 0.409091
18 0.257143
19 0.423077
20 0.277778
21 0.333333
22 0.400000
23 0.375000
24 0.400000
Name: loan_default, dtype: float64
Average loan_default by TD006:
TD006
0 0.168552
1 0.176399
2 0.207509
3 0.242584
4 0.269746
5 0.295133
6 0.325503
7 0.307167
8 0.335526
9 0.462264
10 0.327586
11 0.394737
12 0.388889
13 0.307692
14 0.285714
15 0.333333
16 0.500000
17 0.600000
18 0.333333
20 0.000000
21 1.000000
Name: loan_default, dtype: float64
Average loan_default by TD009:
TD009
0 0.113156
1 0.115699
2 0.139940
3 0.158747
4 0.177003
5 0.195302
6 0.209825
7 0.222468
8 0.239288
9 0.254988
10 0.270819
11 0.274088
12 0.291883
13 0.268065
14 0.310415
15 0.335725
16 0.276986
17 0.332432
18 0.312253
19 0.324742
20 0.391892
21 0.407407
22 0.473684
23 0.264706
24 0.486486
25 0.391304
26 0.466667
27 0.125000
28 0.526316
29 0.375000
30 0.444444
31 0.200000
32 0.666667
33 0.333333
34 0.750000
36 0.000000
38 0.000000
39 0.000000
46 1.000000
Name: loan_default, dtype: float64
Average loan_default by TD010:
TD010
0 0.152064
1 0.163854
2 0.191281
3 0.223402
4 0.248087
5 0.275062
6 0.280515
7 0.298647
8 0.296015
9 0.353741
10 0.350000
11 0.370370
12 0.310811
13 0.347826
14 0.433333
15 0.481481
16 0.555556
17 0.647059
18 0.384615
19 0.000000
20 0.500000
21 0.666667
22 0.428571
23 0.000000
24 0.800000
25 0.500000
26 0.000000
28 0.500000
30 1.000000
35 1.000000
Name: loan_default, dtype: float64
Average loan_default by TD014:
TD014
0 0.142579
1 0.155745
2 0.179754
3 0.205032
4 0.237809
5 0.252252
6 0.266280
7 0.294633
8 0.291915
9 0.306410
10 0.311155
11 0.321101
12 0.331839
13 0.297297
14 0.340909
15 0.428571
16 0.446154
17 0.416667
18 0.233333
19 0.363636
20 0.350000
21 0.312500
22 0.636364
23 0.600000
24 0.571429
25 0.666667
26 0.333333
27 1.000000
28 0.500000
30 0.000000
31 0.000000
32 0.000000
36 1.000000
43 1.000000
Name: loan_default, dtype: float64
# Plot for each column
import matplotlib.pyplot as plt
def plot_histogram(data_frame, column_name):
%matplotlib inline
# Check if the column_name exists in the DataFrame
if column_name not in data_frame.columns:
raise ValueError(f"Column '{column_name}' does not exist in the DataFrame.")
# Plot the histogram
data_frame[column_name].hist()
plot_histogram(df, 'AP003')
# Plot for all columns
def plot_histograms_for_all_columns(data_frame):
%matplotlib inline
for column_name in data_frame.columns:
data_frame[column_name].hist()
plt.title(f'Histogram of {column_name}')
plt.xlabel(column_name)
plt.ylabel('Frequency')
plt.show()
plot_histograms_for_all_columns(df)
import numpy as np
from sklearn.model_selection import train_test_split
# Variables for WOE transformation
variables = ['AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014']
target_variable = 'loan_default'
# Splitting data into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
#Define function of WOE for train data
def WOE(var):
train_df[var] = train_df[var].fillna('NoData')
k = train_df[[var,'loan_default']].groupby(var)['loan_default'].agg(['count','sum']).reset_index()
k.columns = [var,'Count','Good']
k['Bad'] = k['Count'] - k['Good']
k['Good %'] = (k['Good'] / k['Good'].sum()*100).round(2)
k['Bad %'] = (k['Bad'] / k['Bad'].sum()*100).round(2)
k[var+'_WOE'] = np.log(k['Good %'] / k['Bad %']).round(2)
k = k.sort_values(by=var+'_WOE')
return(k)
k = WOE('AP001')
k
| AP001 | Count | Good | Bad | Good % | Bad % | AP001_WOE | |
|---|---|---|---|---|---|---|---|
| 33 | 53 | 224 | 31 | 193 | 0.25 | 0.37 | -0.39 |
| 31 | 51 | 294 | 47 | 247 | 0.38 | 0.48 | -0.23 |
| 34 | 54 | 220 | 37 | 183 | 0.30 | 0.35 | -0.15 |
| 28 | 48 | 530 | 90 | 440 | 0.73 | 0.85 | -0.15 |
| 25 | 45 | 788 | 135 | 653 | 1.09 | 1.26 | -0.14 |
| 17 | 37 | 1530 | 264 | 1266 | 2.14 | 2.45 | -0.14 |
| 16 | 36 | 1941 | 338 | 1603 | 2.74 | 3.10 | -0.12 |
| 29 | 49 | 494 | 87 | 407 | 0.71 | 0.79 | -0.11 |
| 24 | 44 | 790 | 140 | 650 | 1.13 | 1.26 | -0.11 |
| 19 | 39 | 1325 | 234 | 1091 | 1.90 | 2.11 | -0.10 |
| 26 | 46 | 749 | 133 | 616 | 1.08 | 1.19 | -0.10 |
| 15 | 35 | 2321 | 417 | 1904 | 3.38 | 3.69 | -0.09 |
| 21 | 41 | 1022 | 184 | 838 | 1.49 | 1.62 | -0.08 |
| 30 | 50 | 321 | 58 | 263 | 0.47 | 0.51 | -0.08 |
| 23 | 43 | 916 | 167 | 749 | 1.35 | 1.45 | -0.07 |
| 18 | 38 | 1457 | 270 | 1187 | 2.19 | 2.30 | -0.05 |
| 13 | 33 | 2764 | 516 | 2248 | 4.18 | 4.35 | -0.04 |
| 27 | 47 | 658 | 123 | 535 | 1.00 | 1.04 | -0.04 |
| 14 | 34 | 2393 | 449 | 1944 | 3.64 | 3.76 | -0.03 |
| 9 | 29 | 4056 | 761 | 3295 | 6.17 | 6.38 | -0.03 |
| 22 | 42 | 944 | 179 | 765 | 1.45 | 1.48 | -0.02 |
| 12 | 32 | 2947 | 561 | 2386 | 4.55 | 4.62 | -0.02 |
| 36 | 56 | 19 | 4 | 15 | 0.03 | 0.03 | 0.00 |
| 11 | 31 | 3718 | 724 | 2994 | 5.87 | 5.80 | 0.01 |
| 20 | 40 | 1110 | 217 | 893 | 1.76 | 1.73 | 0.02 |
| 10 | 30 | 4358 | 870 | 3488 | 7.05 | 6.75 | 0.04 |
| 8 | 28 | 4704 | 936 | 3768 | 7.59 | 7.29 | 0.04 |
| 6 | 26 | 4378 | 875 | 3503 | 7.09 | 6.78 | 0.04 |
| 5 | 25 | 3938 | 784 | 3154 | 6.35 | 6.11 | 0.04 |
| 4 | 24 | 3426 | 681 | 2745 | 5.52 | 5.31 | 0.04 |
| 3 | 23 | 2330 | 465 | 1865 | 3.77 | 3.61 | 0.04 |
| 35 | 55 | 131 | 26 | 105 | 0.21 | 0.20 | 0.05 |
| 32 | 52 | 219 | 44 | 175 | 0.36 | 0.34 | 0.06 |
| 7 | 27 | 5081 | 1060 | 4021 | 8.59 | 7.78 | 0.10 |
| 2 | 22 | 1310 | 278 | 1032 | 2.25 | 2.00 | 0.12 |
| 0 | 20 | 86 | 18 | 68 | 0.15 | 0.13 | 0.14 |
| 1 | 21 | 508 | 135 | 373 | 1.09 | 0.72 | 0.41 |
#Append the WOE value of feature back to the original train data
#train_df_AP001_WOE = pd.merge(train_df[['loan_default','AP001']],k[['AP001','AP001_WOE']],
# left_on='AP001',
# right_on='AP001',how='left')
#train_df_AP001_WOE.head(10)
train_df_WOE_AP001 = pd.merge(train_df, k[['AP001', 'AP001_WOE']],
left_on='AP001',
right_on='AP001', how='left')
train_df_WOE_AP001.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP001_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | -98.0 | 5 | 8 | 3 | 14 | 5 | 5 | -0.03 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | 17.5 | 2 | 2 | 0 | 2 | 1 | 1 | -0.04 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | -98.0 | 2 | 3 | 1 | 6 | 2 | 2 | 0.01 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | -98.0 | 5 | 9 | 3 | 9 | 3 | 3 | -0.03 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | -98.0 | 2 | 2 | 0 | 2 | 0 | 0 | -0.09 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | -98.0 | 5 | 11 | 3 | 11 | 4 | 4 | 0.04 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | -98.0 | 3 | 4 | 1 | 6 | 3 | 3 | -0.09 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | -98.0 | 4 | 4 | 1 | 6 | 1 | 1 | 0.04 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | 96.0 | 4 | 9 | 1 | 10 | 1 | 2 | 0.04 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | 52.0 | 2 | 3 | 3 | 5 | 3 | 3 | -0.14 |
#Append the WOE table to the test data
test_df_WOE_AP001 = pd.merge(test_df, k[['AP001', 'AP001_WOE']],
left_on='AP001',
right_on='AP001', how='left')
test_df_WOE_AP001.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP001_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | -98.000000 | 2 | 2 | 1 | 2 | 1 | 1 | 0.04 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | -98.000000 | 2 | 4 | 1 | 7 | 1 | 2 | -0.04 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | 7.000000 | 1 | 3 | 1 | 4 | 1 | 1 | -0.03 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | 120.285714 | 1 | 1 | 3 | 1 | 3 | 4 | 0.04 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | 180.000000 | 4 | 7 | 2 | 15 | 5 | 6 | 0.10 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | -98.000000 | 5 | 7 | 0 | 8 | 3 | 4 | -0.04 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | 139.000000 | 9 | 14 | 6 | 25 | 8 | 11 | 0.04 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | 17.000000 | 2 | 3 | 0 | 3 | 1 | 2 | -0.12 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | -98.000000 | 2 | 5 | 1 | 8 | 2 | 6 | 0.41 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | 164.000000 | 3 | 3 | 1 | 6 | 2 | 3 | 0.04 |
#Append the WOE table to the test data
#test_df_AP001_WOE = pd.merge(test_df[['loan_default','AP001']],k[['AP001','AP001_WOE']],
# left_on='AP001',
# right_on='AP001',how='left')
#test_df_AP001_WOE.head(10)
nan_check = test_df_WOE_AP001['AP001_WOE'].isna()
nan_values = test_df_WOE_AP001['AP001_WOE'][nan_check]
nan_values
Series([], Name: AP001_WOE, dtype: float64)
nan_check = train_df_WOE_AP001['AP001_WOE'].isna()
nan_values = train_df_WOE_AP001['AP001_WOE'][nan_check]
nan_values
Series([], Name: AP001_WOE, dtype: float64)
k = WOE('AP003')
k
#Need to bin this variables
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| AP003 | Count | Good | Bad | Good % | Bad % | AP003_WOE | |
|---|---|---|---|---|---|---|---|
| 4 | 6 | 12 | 0 | 12 | 0.00 | 0.02 | -inf |
| 3 | 5 | 187 | 11 | 176 | 0.09 | 0.34 | -1.33 |
| 2 | 4 | 8672 | 1107 | 7565 | 8.97 | 14.64 | -0.49 |
| 1 | 3 | 19072 | 3301 | 15771 | 26.75 | 30.53 | -0.13 |
| 0 | 1 | 36057 | 7919 | 28138 | 64.18 | 54.47 | 0.16 |
#train_df['AP003_bin'] = pd.qcut(train_df['AP003'],5,duplicates='drop').values.add_categories("NoData")
#train_df['AP003_bin'] = train_df['AP003_bin'].fillna("NoData").astype(str)
#train_df['AP003_bin'].value_counts(dropna=False)
#pd.cut: Given the values 0, 1, 3, 4, and 5, here's how they would be categorized based on the default behavior:
#0 will belong to the bin interval [0, 1.2)
#1 will belong to the bin interval [0, 1.2)
#3 will belong to the bin interval [2.4, 3.6)
#4 will belong to the bin interval [3.6, 4.8)
#5 will belong to the bin interval [4.8, 6)
#train_df['AP003_bin'] = pd.cut(train_df['AP003'], bins=5, duplicates='drop', labels=['Category 1', 'Category 2', 'Category 3', 'Category 4', 'Category 5'])
#train_df['AP003_bin'] = train_df['AP003_bin'].astype(str).fillna("NoData")
#train_df['AP003_bin'].value_counts(dropna=False)
#Still has -inf value
#Bin the train data
train_df['AP003_bin'] = pd.qcut(train_df['AP003'],5,duplicates='drop').values.add_categories("NoData")
train_df['AP003_bin'] = train_df['AP003_bin'].fillna("NoData").astype(str)
train_df['AP003_bin'].value_counts(dropna=False)
(0.999, 3.0] 55129 (3.0, 6.0] 8871 Name: AP003_bin, dtype: int64
k = WOE('AP003_bin')
k
| AP003_bin | Count | Good | Bad | Good % | Bad % | AP003_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 1 | (3.0, 6.0] | 8871 | 1118 | 7753 | 9.06 | 15.01 | -0.50 |
| 0 | (0.999, 3.0] | 55129 | 11220 | 43909 | 90.94 | 84.99 | 0.07 |
train_df
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3822 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | -98.0 | 5 | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] |
| 35562 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | 17.5 | 2 | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] |
| 4883 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | -98.0 | 2 | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] |
| 71170 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | -98.0 | 5 | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] |
| 25665 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | -98.0 | 2 | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6265 | 6266 | 0 | 25 | 3 | 3 | 12000 | 5 | 3 | -1.0 | -1.0 | -98.0 | 4 | 4 | 1 | 5 | 1 | 2 | (0.999, 3.0] |
| 54886 | 54887 | 0 | 31 | 3 | 4 | 60300 | 6 | 5 | 69.0 | -1.0 | 39.0 | 2 | 4 | 1 | 5 | 1 | 1 | (0.999, 3.0] |
| 76820 | 76821 | 0 | 28 | 3 | 2 | 45167 | 5 | 3 | -1.0 | -1.0 | -98.0 | 2 | 13 | 3 | 14 | 3 | 3 | (0.999, 3.0] |
| 860 | 861 | 1 | 28 | 1 | 5 | 59111 | 6 | 11 | -1.0 | -1.0 | -98.0 | 1 | 2 | 2 | 8 | 2 | 2 | (0.999, 3.0] |
| 15795 | 15796 | 0 | 27 | 1 | 4 | 2878 | 5 | 2 | -1.0 | -1.0 | -98.0 | 1 | 1 | 1 | 3 | 1 | 1 | (0.999, 3.0] |
64000 rows × 18 columns
train_df_WOE_AP003 = pd.merge(train_df, k[['AP003_bin', 'AP003_bin_WOE']],
left_on='AP003_bin',
right_on='AP003_bin', how='left')
train_df_WOE_AP003.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | AP003_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | -98.0 | 5 | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | -0.50 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | 17.5 | 2 | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | 0.07 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | -98.0 | 2 | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | 0.07 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | -98.0 | 5 | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | 0.07 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | -98.0 | 2 | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | -0.50 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | -98.0 | 5 | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | 0.07 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | -98.0 | 3 | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | 0.07 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | -98.0 | 4 | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | 0.07 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | 96.0 | 4 | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | 0.07 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | 52.0 | 2 | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | 0.07 |
#train_df_WOE_AP003_usedtomerge = train_df_WOE_AP003.drop(columns=train_df_WOE_AP003.columns.difference(['AP003', 'AP003_bin']))
#train_df_WOE_AP003_usedtomerge
#Merge the WOE value of each category with the train data
#train_df_AP003_WOE = pd.merge(train_df[['loan_default','AP003''AP003_bin']],k[['AP003_bin','AP003_bin_WOE']],
# left_on='AP003_bin',
# right_on='AP003_bin',how='left')
#train_df_AP003_WOE.head(10)
#train_df_WOE = pd.merge(train_df_WOE, train_df_usedtomerge[['AP003', 'AP003_bin']],
# left_on='AP003',
# right_on='AP003', how='left')
#train_df_WOE.head(10)
nan_check = train_df_WOE_AP003['AP003_bin_WOE'].isna()
nan_values = train_df_WOE_AP003['AP003_bin_WOE'][nan_check]
nan_values
Series([], Name: AP003_bin_WOE, dtype: float64)
#Append the WOE value of each category back to the original train data
#train_df['AP003_WOE']=train_df_WOE_AP003['AP003_bin_WOE']
# Define the desired bin labels
bin_labels = ["(0.999, 3.0]", "(3.0, 6.0]"]
# Bin the test data with the specified labels
test_df['AP003_bin_labels'] = pd.qcut(test_df['AP003'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['AP003_bin'] = pd.qcut(test_df['AP003'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['AP003_bin'] = test_df['AP003_bin'].fillna("NoData")
# Print the value counts
test_df['AP003_bin'].value_counts(dropna=False)
(0.999, 3.0] 13779 (3.0, 6.0] 2221 Name: AP003_bin, dtype: int64
test_df
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47044 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | -98.000000 | 2 | 2 | 1 | 2 | 1 | 1 | 0 | (0.999, 3.0] |
| 44295 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | -98.000000 | 2 | 4 | 1 | 7 | 1 | 2 | 0 | (0.999, 3.0] |
| 74783 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | 7.000000 | 1 | 3 | 1 | 4 | 1 | 1 | 1 | (3.0, 6.0] |
| 70975 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | 120.285714 | 1 | 1 | 3 | 1 | 3 | 4 | 0 | (0.999, 3.0] |
| 46645 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | 180.000000 | 4 | 7 | 2 | 15 | 5 | 6 | 0 | (0.999, 3.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67666 | 67667 | 0 | 41 | 1 | 5 | 46967 | 6 | 11 | 56.0 | 56.0 | 0.000000 | 2 | 3 | 2 | 4 | 4 | 4 | 0 | (0.999, 3.0] |
| 51146 | 51147 | 0 | 39 | 1 | 2 | 25796 | 6 | 2 | 91.0 | 91.0 | 59.500000 | 4 | 11 | 3 | 14 | 4 | 5 | 0 | (0.999, 3.0] |
| 42494 | 42495 | 1 | 31 | 1 | 2 | 0 | 5 | 3 | -1.0 | -1.0 | -98.000000 | 3 | 3 | 1 | 3 | 1 | 2 | 0 | (0.999, 3.0] |
| 52517 | 52518 | 0 | 34 | 1 | 1 | 3600 | 3 | 2 | -1.0 | -1.0 | -98.000000 | 3 | 3 | 1 | 3 | 1 | 2 | 0 | (0.999, 3.0] |
| 7754 | 7755 | 0 | 43 | 3 | 2 | 52000 | 6 | 10 | -1.0 | -1.0 | -98.000000 | 2 | 5 | 1 | 10 | 3 | 5 | 0 | (0.999, 3.0] |
16000 rows × 19 columns
#Append the WOE table to the test data
#test_df_WOE_AP003 = pd.merge(test_df[['id','loan_default','AP003','AP003_bin']],k[['AP003_bin','AP003_bin_WOE']],
# left_on='AP003_bin',
# right_on='AP003_bin',how='left')
#test_df_AP003_WOE.head(10)
#TD010 way
test_df_WOE_AP003 = pd.merge(test_df, k[['AP003_bin', 'AP003_bin_WOE']],
left_on='AP003_bin',
right_on='AP003_bin', how='left')
test_df_WOE_AP003.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | AP003_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | -98.000000 | 2 | 2 | 1 | 2 | 1 | 1 | 0 | (0.999, 3.0] | 0.07 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | -98.000000 | 2 | 4 | 1 | 7 | 1 | 2 | 0 | (0.999, 3.0] | 0.07 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | 7.000000 | 1 | 3 | 1 | 4 | 1 | 1 | 1 | (3.0, 6.0] | -0.50 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | 120.285714 | 1 | 1 | 3 | 1 | 3 | 4 | 0 | (0.999, 3.0] | 0.07 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | 180.000000 | 4 | 7 | 2 | 15 | 5 | 6 | 0 | (0.999, 3.0] | 0.07 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | -98.000000 | 5 | 7 | 0 | 8 | 3 | 4 | 1 | (3.0, 6.0] | -0.50 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | 139.000000 | 9 | 14 | 6 | 25 | 8 | 11 | 0 | (0.999, 3.0] | 0.07 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | 17.000000 | 2 | 3 | 0 | 3 | 1 | 2 | 0 | (0.999, 3.0] | 0.07 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | -98.000000 | 2 | 5 | 1 | 8 | 2 | 6 | 0 | (0.999, 3.0] | 0.07 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | 164.000000 | 3 | 3 | 1 | 6 | 2 | 3 | 0 | (0.999, 3.0] | 0.07 |
nan_check = test_df_WOE_AP003['AP003_bin_WOE'].isna()
nan_values = test_df_WOE_AP003['AP003_bin_WOE'][nan_check]
nan_values
Series([], Name: AP003_bin_WOE, dtype: float64)
k = WOE('AP008')
k
| AP008 | Count | Good | Bad | Good % | Bad % | AP008_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 6788 | 1107 | 5681 | 8.97 | 11.00 | -0.20 |
| 1 | 2 | 17470 | 3119 | 14351 | 25.28 | 27.78 | -0.09 |
| 2 | 3 | 14818 | 2902 | 11916 | 23.52 | 23.07 | 0.02 |
| 3 | 4 | 11381 | 2356 | 9025 | 19.10 | 17.47 | 0.09 |
| 4 | 5 | 13543 | 2854 | 10689 | 23.13 | 20.69 | 0.11 |
#Append the WOE value of feature back to the original train data
train_df_WOE_AP008 = pd.merge(train_df, k[['AP008', 'AP008_WOE']],
left_on='AP008',
right_on='AP008', how='left')
train_df_WOE_AP008.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | AP008_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | -98.0 | 5 | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | -0.09 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | 17.5 | 2 | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | -0.09 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | -98.0 | 2 | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | 0.11 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | -98.0 | 5 | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | 0.09 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | -98.0 | 2 | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | 0.02 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | -98.0 | 5 | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | -0.09 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | -98.0 | 3 | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | 0.11 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | -98.0 | 4 | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | 0.11 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | 96.0 | 4 | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | 0.11 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | 52.0 | 2 | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | 0.02 |
#Append the WOE table to the test data
test_df_WOE_AP008 = pd.merge(test_df, k[['AP008', 'AP008_WOE']],
left_on='AP008',
right_on='AP008', how='left')
test_df_WOE_AP008.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | AP008_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | -98.000000 | 2 | 2 | 1 | 2 | 1 | 1 | 0 | (0.999, 3.0] | 0.02 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | -98.000000 | 2 | 4 | 1 | 7 | 1 | 2 | 0 | (0.999, 3.0] | 0.11 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | 7.000000 | 1 | 3 | 1 | 4 | 1 | 1 | 1 | (3.0, 6.0] | 0.11 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | 120.285714 | 1 | 1 | 3 | 1 | 3 | 4 | 0 | (0.999, 3.0] | 0.11 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | 180.000000 | 4 | 7 | 2 | 15 | 5 | 6 | 0 | (0.999, 3.0] | 0.02 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | -98.000000 | 5 | 7 | 0 | 8 | 3 | 4 | 1 | (3.0, 6.0] | -0.20 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | 139.000000 | 9 | 14 | 6 | 25 | 8 | 11 | 0 | (0.999, 3.0] | -0.20 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | 17.000000 | 2 | 3 | 0 | 3 | 1 | 2 | 0 | (0.999, 3.0] | 0.02 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | -98.000000 | 2 | 5 | 1 | 8 | 2 | 6 | 0 | (0.999, 3.0] | 0.02 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | 164.000000 | 3 | 3 | 1 | 6 | 2 | 3 | 0 | (0.999, 3.0] | -0.09 |
nan_check = test_df_WOE_AP008['AP008_WOE'].isna()
nan_values = test_df_WOE_AP008['AP008_WOE'][nan_check]
nan_values
Series([], Name: AP008_WOE, dtype: float64)
nan_check = train_df_WOE_AP008['AP008_WOE'].isna()
nan_values = train_df_WOE_AP008['AP008_WOE'][nan_check]
nan_values
Series([], Name: AP008_WOE, dtype: float64)
k = WOE('CR009')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| CR009 | Count | Good | Bad | Good % | Bad % | CR009_WOE | |
|---|---|---|---|---|---|---|---|
| 3296 | 12050 | 4 | 0 | 4 | 0.0 | 0.01 | -inf |
| 3741 | 13257 | 4 | 0 | 4 | 0.0 | 0.01 | -inf |
| 17901 | 88500 | 5 | 0 | 5 | 0.0 | 0.01 | -inf |
| 3695 | 13125 | 3 | 0 | 3 | 0.0 | 0.01 | -inf |
| 8695 | 27288 | 3 | 0 | 3 | 0.0 | 0.01 | -inf |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 21847 | 1214822 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 21848 | 1238000 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 21849 | 1243934 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 21851 | 1381000 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 21852 | 1420300 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
21853 rows × 7 columns
#Bin the train data
train_df['CR009_bin'] = pd.qcut(train_df['CR009'],5,duplicates='drop').values.add_categories("NoData")
train_df['CR009_bin'] = train_df['CR009_bin'].fillna("NoData").astype(str)
train_df['CR009_bin'].value_counts(dropna=False)
(24221.8, 50000.0] 13072 (-0.001, 2500.0] 13072 (11484.4, 24221.8] 12800 (50000.0, 1420300.0] 12528 (2500.0, 11484.4] 12528 Name: CR009_bin, dtype: int64
k = WOE('CR009_bin')
k
| CR009_bin | Count | Good | Bad | Good % | Bad % | CR009_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 4 | (50000.0, 1420300.0] | 12528 | 2158 | 10370 | 17.49 | 20.07 | -0.14 |
| 0 | (-0.001, 2500.0] | 13072 | 2338 | 10734 | 18.95 | 20.78 | -0.09 |
| 1 | (11484.4, 24221.8] | 12800 | 2615 | 10185 | 21.19 | 19.71 | 0.07 |
| 2 | (24221.8, 50000.0] | 13072 | 2658 | 10414 | 21.54 | 20.16 | 0.07 |
| 3 | (2500.0, 11484.4] | 12528 | 2569 | 9959 | 20.82 | 19.28 | 0.08 |
#Append the WOE value of each category back to the original train data
train_df_WOE_CR009 = pd.merge(train_df, k[['CR009_bin', 'CR009_bin_WOE']],
left_on='CR009_bin',
right_on='CR009_bin', how='left')
train_df_WOE_CR009.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | PA029 | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR009_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | -98.0 | 5 | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | 0.07 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | 17.5 | 2 | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | -0.09 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | -98.0 | 2 | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | 0.07 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | -98.0 | 5 | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | 0.07 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | -98.0 | 2 | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | -0.14 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | -98.0 | 5 | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | 0.07 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | -98.0 | 3 | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | -0.09 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | -98.0 | 4 | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | -0.09 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | 96.0 | 4 | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | 0.07 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | 52.0 | 2 | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | -0.14 |
nan_check = train_df_WOE_CR009['CR009_bin_WOE'].isna()
nan_values = train_df_WOE_CR009['CR009_bin_WOE'][nan_check]
nan_values
Series([], Name: CR009_bin_WOE, dtype: float64)
# Define the desired bin labels
bin_labels = ["(24221.8, 50000.0]", "(-0.001, 2500.0]","(11484.4, 24221.8]","(50000.0, 1420300.0]","(2500.0, 11484.4]"]
# Bin the test data with the specified labels
test_df['CR009_bin_labels'] = pd.qcut(test_df['CR009'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['CR009_bin'] = pd.qcut(test_df['CR009'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['CR009_bin'] = test_df['CR009_bin'].fillna("NoData")
# Print the value counts
test_df['CR009_bin'].value_counts(dropna=False)
(24221.8, 50000.0] 3265 (50000.0, 1420300.0] 3209 (11484.4, 24221.8] 3207 (2500.0, 11484.4] 3176 (-0.001, 2500.0] 3143 Name: CR009_bin, dtype: int64
test_df_WOE_CR009 = pd.merge(test_df, k[['CR009_bin', 'CR009_bin_WOE']],
left_on='CR009_bin',
right_on='CR009_bin', how='left')
test_df_WOE_CR009.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | CR009_bin_labels | CR009_bin | CR009_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 2 | 1 | 2 | 1 | 1 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | -0.09 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 4 | 1 | 7 | 1 | 2 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | -0.14 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 3 | 1 | 4 | 1 | 1 | 1 | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | -0.14 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 1 | 3 | 1 | 3 | 4 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | -0.09 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 7 | 2 | 15 | 5 | 6 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | -0.14 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 7 | 0 | 8 | 3 | 4 | 1 | (3.0, 6.0] | 1 | (-0.001, 2500.0] | -0.09 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 14 | 6 | 25 | 8 | 11 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | -0.09 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 3 | 0 | 3 | 1 | 2 | 0 | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 0.07 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 5 | 1 | 8 | 2 | 6 | 0 | (0.999, 3.0] | 2 | (11484.4, 24221.8] | 0.07 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 3 | 1 | 6 | 2 | 3 | 0 | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 0.08 |
10 rows × 22 columns
nan_check = test_df_WOE_CR009['CR009_bin_WOE'].isna()
nan_values = test_df_WOE_CR009['CR009_bin_WOE'][nan_check]
nan_values
Series([], Name: CR009_bin_WOE, dtype: float64)
k = WOE('CR015')
k
| CR015 | Count | Good | Bad | Good % | Bad % | CR015_WOE | |
|---|---|---|---|---|---|---|---|
| 4 | 6 | 21562 | 3337 | 18225 | 27.05 | 35.28 | -0.27 |
| 0 | 2 | 2676 | 503 | 2173 | 4.08 | 4.21 | -0.03 |
| 3 | 5 | 27500 | 5641 | 21859 | 45.72 | 42.31 | 0.08 |
| 2 | 4 | 5870 | 1278 | 4592 | 10.36 | 8.89 | 0.15 |
| 1 | 3 | 6392 | 1579 | 4813 | 12.80 | 9.32 | 0.32 |
#Bin the train data
train_df['CR015_bin'] = pd.qcut(train_df['CR015'],5,duplicates='drop').values.add_categories("NoData")
train_df['CR015_bin'] = train_df['CR015_bin'].fillna("NoData").astype(str)
train_df['CR015_bin'].value_counts(dropna=False)
(4.0, 5.0] 27500 (5.0, 6.0] 21562 (1.999, 4.0] 14938 Name: CR015_bin, dtype: int64
k = WOE('CR015_bin')
k
| CR015_bin | Count | Good | Bad | Good % | Bad % | CR015_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 2 | (5.0, 6.0] | 21562 | 3337 | 18225 | 27.05 | 35.28 | -0.27 |
| 1 | (4.0, 5.0] | 27500 | 5641 | 21859 | 45.72 | 42.31 | 0.08 |
| 0 | (1.999, 4.0] | 14938 | 3360 | 11578 | 27.23 | 22.41 | 0.19 |
#Append the WOE value of each category back to the original train data
train_df_WOE_CR015 = pd.merge(train_df,k[['CR015_bin','CR015_bin_WOE']],
left_on='CR015_bin',
right_on='CR015_bin',how='left')
train_df_WOE_CR015.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | CR015_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 5 | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | 0.08 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 2 | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | -0.27 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 2 | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | 0.08 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 5 | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | -0.27 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 2 | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | -0.27 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 5 | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | 0.08 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 3 | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | -0.27 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 4 | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | 0.19 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 4 | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | 0.08 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 2 | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | 0.08 |
10 rows × 21 columns
nan_check = train_df_WOE_CR015['CR015_bin_WOE'].isna()
nan_values = train_df_WOE_CR015['CR015_bin_WOE'][nan_check]
nan_values
Series([], Name: CR015_bin_WOE, dtype: float64)
# Define the desired bin labels
bin_labels = ["(4.0, 5.0]", "(5.0, 6.0]","(1.999, 4.0]"]
# Bin the test data with the specified labels
test_df['CR015_bin_labels'] = pd.qcut(test_df['CR015'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['CR015_bin'] = pd.qcut(test_df['CR015'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['CR015_bin'] = test_df['CR015_bin'].fillna("NoData")
# Print the value counts
test_df['CR015_bin'].value_counts(dropna=False)
(5.0, 6.0] 6839 (1.999, 4.0] 5565 (4.0, 5.0] 3596 Name: CR015_bin, dtype: int64
test_df
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD006 | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | CR009_bin_labels | CR009_bin | CR015_bin_labels | CR015_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47044 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 1 | 2 | 1 | 1 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] |
| 44295 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 1 | 7 | 1 | 2 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] |
| 74783 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 1 | 4 | 1 | 1 | 1 | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] |
| 70975 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 3 | 1 | 3 | 4 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] |
| 46645 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 2 | 15 | 5 | 6 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 67666 | 67667 | 0 | 41 | 1 | 5 | 46967 | 6 | 11 | 56.0 | 56.0 | ... | 2 | 4 | 4 | 4 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 2 | (1.999, 4.0] |
| 51146 | 51147 | 0 | 39 | 1 | 2 | 25796 | 6 | 2 | 91.0 | 91.0 | ... | 3 | 14 | 4 | 5 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 2 | (1.999, 4.0] |
| 42494 | 42495 | 1 | 31 | 1 | 2 | 0 | 5 | 3 | -1.0 | -1.0 | ... | 1 | 3 | 1 | 2 | 0 | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 1 | (5.0, 6.0] |
| 52517 | 52518 | 0 | 34 | 1 | 1 | 3600 | 3 | 2 | -1.0 | -1.0 | ... | 1 | 3 | 1 | 2 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 0 | (4.0, 5.0] |
| 7754 | 7755 | 0 | 43 | 3 | 2 | 52000 | 6 | 10 | -1.0 | -1.0 | ... | 1 | 10 | 3 | 5 | 0 | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 2 | (1.999, 4.0] |
16000 rows × 23 columns
#Append the WOE table to the test data
test_df_WOE_CR015 = pd.merge(test_df,k[['CR015_bin','CR015_bin_WOE']],
left_on='CR015_bin',
right_on='CR015_bin',how='left')
test_df_WOE_CR015.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | CR009_bin_labels | CR009_bin | CR015_bin_labels | CR015_bin | CR015_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 2 | 1 | 1 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | -0.27 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 7 | 1 | 2 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | -0.27 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 4 | 1 | 1 | 1 | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | -0.27 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 1 | 3 | 4 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | -0.27 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 15 | 5 | 6 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | -0.27 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 8 | 3 | 4 | 1 | (3.0, 6.0] | 1 | (-0.001, 2500.0] | 2 | (1.999, 4.0] | 0.19 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 25 | 8 | 11 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 0 | (4.0, 5.0] | 0.08 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 3 | 1 | 2 | 0 | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 1 | (5.0, 6.0] | -0.27 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 8 | 2 | 6 | 0 | (0.999, 3.0] | 2 | (11484.4, 24221.8] | 1 | (5.0, 6.0] | -0.27 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 6 | 2 | 3 | 0 | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 1 | (5.0, 6.0] | -0.27 |
10 rows × 24 columns
nan_check = test_df_WOE_CR015['CR015_bin_WOE'].isna()
nan_values = test_df_WOE_CR015['CR015_bin_WOE'][nan_check]
nan_values
Series([], Name: CR015_bin_WOE, dtype: float64)
k = WOE('CR019')
k
| CR019 | Count | Good | Bad | Good % | Bad % | CR019_WOE | |
|---|---|---|---|---|---|---|---|
| 11 | 12 | 3499 | 564 | 2935 | 4.57 | 5.68 | -0.22 |
| 10 | 11 | 10678 | 1753 | 8925 | 14.21 | 17.28 | -0.20 |
| 8 | 9 | 2318 | 388 | 1930 | 3.14 | 3.74 | -0.17 |
| 5 | 6 | 4136 | 744 | 3392 | 6.03 | 6.57 | -0.09 |
| 9 | 10 | 1808 | 332 | 1476 | 2.69 | 2.86 | -0.06 |
| 7 | 8 | 2615 | 484 | 2131 | 3.92 | 4.12 | -0.05 |
| 6 | 7 | 5150 | 982 | 4168 | 7.96 | 8.07 | -0.01 |
| 4 | 5 | 7699 | 1513 | 6186 | 12.26 | 11.97 | 0.02 |
| 0 | 1 | 872 | 182 | 690 | 1.48 | 1.34 | 0.10 |
| 2 | 3 | 10654 | 2263 | 8391 | 18.34 | 16.24 | 0.12 |
| 3 | 4 | 7761 | 1662 | 6099 | 13.47 | 11.81 | 0.13 |
| 1 | 2 | 6810 | 1471 | 5339 | 11.92 | 10.33 | 0.14 |
#Append the WOE value of each category back to the original train data
train_df_WOE_CR019 = pd.merge(train_df,k[['CR019','CR019_WOE']],
left_on='CR019',
right_on='CR019',how='left')
train_df_WOE_CR019.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD001 | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | CR019_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 5 | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | 0.02 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 2 | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | -0.22 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 2 | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | -0.22 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 5 | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | 0.02 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 2 | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | -0.01 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 5 | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | 0.13 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 3 | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | 0.12 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 4 | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | 0.02 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 4 | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | 0.02 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 2 | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | -0.01 |
10 rows × 21 columns
#Append the WOE table to the test data
test_df_WOE_CR019 = pd.merge(test_df,k[['CR019','CR019_WOE']],
left_on='CR019',
right_on='CR019',how='left')
test_df_WOE_CR019.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD009 | TD010 | TD014 | AP003_bin_labels | AP003_bin | CR009_bin_labels | CR009_bin | CR015_bin_labels | CR015_bin | CR019_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 2 | 1 | 1 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 0.02 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 7 | 1 | 2 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 0.02 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 4 | 1 | 1 | 1 | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | -0.20 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 1 | 3 | 4 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 0.12 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 15 | 5 | 6 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | -0.20 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 8 | 3 | 4 | 1 | (3.0, 6.0] | 1 | (-0.001, 2500.0] | 2 | (1.999, 4.0] | -0.20 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 25 | 8 | 11 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 0 | (4.0, 5.0] | 0.12 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 3 | 1 | 2 | 0 | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 1 | (5.0, 6.0] | 0.12 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 8 | 2 | 6 | 0 | (0.999, 3.0] | 2 | (11484.4, 24221.8] | 1 | (5.0, 6.0] | -0.05 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 6 | 2 | 3 | 0 | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 1 | (5.0, 6.0] | -0.06 |
10 rows × 24 columns
nan_check = test_df_WOE_CR019['CR019'].isna()
nan_values = test_df_WOE_CR019['CR019_WOE'][nan_check]
nan_values
Series([], Name: CR019_WOE, dtype: float64)
nan_check= train_df_WOE_CR019['CR019_WOE'].isna()
nan_values = train_df_WOE_CR019['CR019_WOE'][nan_check]
nan_values
Series([], Name: CR019_WOE, dtype: float64)
k = WOE('TD001')
k
| TD001 | Count | Good | Bad | Good % | Bad % | TD001_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 15698 | 2455 | 13243 | 19.90 | 25.63 | -0.25 |
| 1 | 1 | 10707 | 1723 | 8984 | 13.96 | 17.39 | -0.22 |
| 16 | 16 | 6 | 1 | 5 | 0.01 | 0.01 | 0.00 |
| 2 | 2 | 17835 | 3487 | 14348 | 28.26 | 27.77 | 0.02 |
| 3 | 3 | 9755 | 2069 | 7686 | 16.77 | 14.88 | 0.12 |
| 4 | 4 | 4891 | 1163 | 3728 | 9.43 | 7.22 | 0.27 |
| 5 | 5 | 2313 | 614 | 1699 | 4.98 | 3.29 | 0.41 |
| 10 | 10 | 112 | 29 | 83 | 0.24 | 0.16 | 0.41 |
| 6 | 6 | 1267 | 350 | 917 | 2.84 | 1.77 | 0.47 |
| 7 | 7 | 712 | 199 | 513 | 1.61 | 0.99 | 0.49 |
| 12 | 12 | 36 | 11 | 25 | 0.09 | 0.05 | 0.59 |
| 9 | 9 | 189 | 61 | 128 | 0.49 | 0.25 | 0.67 |
| 8 | 8 | 364 | 126 | 238 | 1.02 | 0.46 | 0.80 |
| 11 | 11 | 65 | 25 | 40 | 0.20 | 0.08 | 0.92 |
| 15 | 15 | 8 | 4 | 4 | 0.03 | 0.01 | 1.10 |
| 13 | 13 | 22 | 10 | 12 | 0.08 | 0.02 | 1.39 |
| 14 | 14 | 12 | 6 | 6 | 0.05 | 0.01 | 1.61 |
| 19 | 19 | 4 | 3 | 1 | 0.02 | 0.00 | inf |
| 18 | 18 | 2 | 1 | 1 | 0.01 | 0.00 | inf |
| 20 | 20 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 17 | 17 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
#Bin the train data
train_df['TD001_bin'] = pd.qcut(train_df['TD001'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD001_bin'] = train_df['TD001_bin'].fillna("NoData").astype(str)
train_df['TD001_bin'].value_counts(dropna=False)
(-0.001, 1.0] 26405 (1.0, 2.0] 17835 (3.0, 20.0] 10005 (2.0, 3.0] 9755 Name: TD001_bin, dtype: int64
k = WOE('TD001_bin')
k
| TD001_bin | Count | Good | Bad | Good % | Bad % | TD001_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | (-0.001, 1.0] | 26405 | 4178 | 22227 | 33.86 | 43.02 | -0.24 |
| 1 | (1.0, 2.0] | 17835 | 3487 | 14348 | 28.26 | 27.77 | 0.02 |
| 2 | (2.0, 3.0] | 9755 | 2069 | 7686 | 16.77 | 14.88 | 0.12 |
| 3 | (3.0, 20.0] | 10005 | 2604 | 7401 | 21.11 | 14.33 | 0.39 |
#Append the WOE value of each category back to the original train data
train_df_WOE_TD001 = pd.merge(train_df,k[['TD001_bin','TD001_bin_WOE']],
left_on='TD001_bin',
right_on='TD001_bin',how='left')
train_df_WOE_TD001.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD001_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | 0.39 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | 0.02 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | 0.02 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | 0.39 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | 0.02 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | 0.39 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | 0.12 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | 0.39 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | 0.39 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | 0.02 |
10 rows × 22 columns
nan_check= train_df_WOE_TD001['TD001_bin_WOE'].isna()
nan_values = train_df_WOE_TD001['TD001_bin_WOE'][nan_check]
nan_values
Series([], Name: TD001_bin_WOE, dtype: float64)
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(1.0, 2.0]", "(3.0, 20.0]","(2.0, 3.0]"]
# Bin the test data with the specified labels
test_df['TD001_bin_labels'] = pd.qcut(test_df['TD001'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD001_bin'] = pd.qcut(test_df['TD001'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD001_bin'] = test_df['TD001_bin'].fillna("NoData")
# Print the value counts
test_df['TD001_bin'].value_counts(dropna=False)
(-0.001, 1.0] 6635 (1.0, 2.0] 4364 (2.0, 3.0] 2570 (3.0, 20.0] 2431 Name: TD001_bin, dtype: int64
test_df_WOE_TD001 = pd.merge(test_df, k[['TD001_bin', 'TD001_bin_WOE']],
left_on='TD001_bin',
right_on='TD001_bin', how='left')
test_df_WOE_TD001.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD014 | AP003_bin_labels | AP003_bin | CR009_bin_labels | CR009_bin | CR015_bin_labels | CR015_bin | TD001_bin_labels | TD001_bin | TD001_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 1 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0.02 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 2 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0.02 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 1 | 1 | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | -0.24 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 4 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | -0.24 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 6 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 3 | (2.0, 3.0] | 0.12 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 4 | 1 | (3.0, 6.0] | 1 | (-0.001, 2500.0] | 2 | (1.999, 4.0] | 3 | (2.0, 3.0] | 0.12 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 11 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 0 | (4.0, 5.0] | 3 | (2.0, 3.0] | 0.12 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 2 | 0 | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0.02 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 6 | 0 | (0.999, 3.0] | 2 | (11484.4, 24221.8] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0.02 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 3 | 0 | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 1 | (5.0, 6.0] | 2 | (3.0, 20.0] | 0.39 |
10 rows × 26 columns
nan_check = test_df_WOE_TD001['TD001_bin_WOE'].isna()
nan_values = test_df_WOE_TD001['TD001_bin_WOE'][nan_check]
nan_values
Series([], Name: TD001_bin_WOE, dtype: float64)
k = WOE('TD005')
k
| TD005 | Count | Good | Bad | Good % | Bad % | TD005_WOE | |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 6735 | 844 | 5891 | 6.84 | 11.40 | -0.51 |
| 0 | 0 | 6157 | 821 | 5336 | 6.65 | 10.33 | -0.44 |
| 2 | 2 | 13559 | 2188 | 11371 | 17.73 | 22.01 | -0.22 |
| 3 | 3 | 10995 | 2076 | 8919 | 16.83 | 17.26 | -0.03 |
| 23 | 23 | 4 | 1 | 3 | 0.01 | 0.01 | 0.00 |
| 24 | 24 | 4 | 1 | 3 | 0.01 | 0.01 | 0.00 |
| 4 | 4 | 8174 | 1633 | 6541 | 13.24 | 12.66 | 0.04 |
| 5 | 5 | 5779 | 1340 | 4439 | 10.86 | 8.59 | 0.23 |
| 6 | 6 | 4081 | 991 | 3090 | 8.03 | 5.98 | 0.29 |
| 7 | 7 | 2835 | 739 | 2096 | 5.99 | 4.06 | 0.39 |
| 8 | 8 | 1928 | 510 | 1418 | 4.13 | 2.74 | 0.41 |
| 18 | 18 | 31 | 9 | 22 | 0.07 | 0.04 | 0.56 |
| 11 | 11 | 566 | 170 | 396 | 1.38 | 0.77 | 0.58 |
| 9 | 9 | 1285 | 386 | 899 | 3.13 | 1.74 | 0.59 |
| 10 | 10 | 785 | 244 | 541 | 1.98 | 1.05 | 0.63 |
| 13 | 13 | 254 | 81 | 173 | 0.66 | 0.33 | 0.69 |
| 22 | 22 | 9 | 3 | 6 | 0.02 | 0.01 | 0.69 |
| 21 | 21 | 10 | 3 | 7 | 0.02 | 0.01 | 0.69 |
| 20 | 20 | 16 | 5 | 11 | 0.04 | 0.02 | 0.69 |
| 12 | 12 | 348 | 115 | 233 | 0.93 | 0.45 | 0.73 |
| 16 | 16 | 77 | 29 | 48 | 0.24 | 0.09 | 0.98 |
| 19 | 19 | 24 | 10 | 14 | 0.08 | 0.03 | 0.98 |
| 15 | 15 | 110 | 43 | 67 | 0.35 | 0.13 | 0.99 |
| 17 | 17 | 59 | 24 | 35 | 0.19 | 0.07 | 1.00 |
| 14 | 14 | 175 | 72 | 103 | 0.58 | 0.20 | 1.06 |
#Append the WOE value of each category back to the original train data
train_df_WOE_TD005 = pd.merge(train_df,k[['TD005','TD005_WOE']],
left_on='TD005',
right_on='TD005',how='left')
train_df_WOE_TD005.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD005 | TD006 | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD005_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 8 | 3 | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | 0.41 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 2 | 0 | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | -0.22 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 3 | 1 | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | -0.03 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 9 | 3 | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | 0.59 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 2 | 0 | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | -0.22 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 11 | 3 | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | 0.58 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 4 | 1 | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | 0.04 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 4 | 1 | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | 0.04 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 9 | 1 | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | 0.59 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 3 | 3 | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | -0.03 |
10 rows × 22 columns
#Append the WOE table to the test data
test_df_WOE_TD005 = pd.merge(test_df,k[['TD005','TD005_WOE']],
left_on='TD005',
right_on='TD005',how='left')
test_df_WOE_TD005.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD014 | AP003_bin_labels | AP003_bin | CR009_bin_labels | CR009_bin | CR015_bin_labels | CR015_bin | TD001_bin_labels | TD001_bin | TD005_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 1 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | -0.22 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 2 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0.04 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 1 | 1 | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | -0.03 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 4 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | -0.51 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 6 | 0 | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 3 | (2.0, 3.0] | 0.39 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 4 | 1 | (3.0, 6.0] | 1 | (-0.001, 2500.0] | 2 | (1.999, 4.0] | 3 | (2.0, 3.0] | 0.39 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 11 | 0 | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 0 | (4.0, 5.0] | 3 | (2.0, 3.0] | 1.06 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 2 | 0 | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | -0.03 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 6 | 0 | (0.999, 3.0] | 2 | (11484.4, 24221.8] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0.23 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 3 | 0 | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 1 | (5.0, 6.0] | 2 | (3.0, 20.0] | -0.03 |
10 rows × 26 columns
nan_check = test_df_WOE_TD005['TD005_WOE'].isna()
nan_values = test_df_WOE_TD005['TD005_WOE'][nan_check]
nan_values
Series([], Name: TD005_WOE, dtype: float64)
nan_check= train_df_WOE_TD005['TD005_WOE'].isna()
nan_values = train_df_WOE_TD005['TD005_WOE'][nan_check]
nan_values
Series([], Name: TD005_WOE, dtype: float64)
k = WOE('TD006')
k
| TD006 | Count | Good | Bad | Good % | Bad % | TD006_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 18701 | 3135 | 15566 | 25.41 | 30.13 | -0.17 |
| 1 | 1 | 23081 | 4027 | 19054 | 32.64 | 36.88 | -0.12 |
| 14 | 14 | 6 | 1 | 5 | 0.01 | 0.01 | 0.00 |
| 13 | 13 | 12 | 3 | 9 | 0.02 | 0.02 | 0.00 |
| 2 | 2 | 12417 | 2610 | 9807 | 21.15 | 18.98 | 0.11 |
| 3 | 3 | 5461 | 1299 | 4162 | 10.53 | 8.06 | 0.27 |
| 4 | 4 | 2292 | 624 | 1668 | 5.06 | 3.23 | 0.45 |
| 5 | 5 | 1014 | 295 | 719 | 2.39 | 1.39 | 0.54 |
| 7 | 7 | 242 | 72 | 170 | 0.58 | 0.33 | 0.56 |
| 6 | 6 | 464 | 151 | 313 | 1.22 | 0.61 | 0.69 |
| 10 | 10 | 47 | 16 | 31 | 0.13 | 0.06 | 0.77 |
| 8 | 8 | 127 | 44 | 83 | 0.36 | 0.16 | 0.81 |
| 11 | 11 | 26 | 11 | 15 | 0.09 | 0.03 | 1.10 |
| 9 | 9 | 86 | 40 | 46 | 0.32 | 0.09 | 1.27 |
| 12 | 12 | 13 | 6 | 7 | 0.05 | 0.01 | 1.61 |
| 18 | 18 | 3 | 1 | 2 | 0.01 | 0.00 | inf |
| 17 | 17 | 4 | 2 | 2 | 0.02 | 0.00 | inf |
| 20 | 21 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 15 | 15 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 16 | 16 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 19 | 20 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
#Bin the train data
train_df['TD006_bin'] = pd.qcut(train_df['TD006'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD006_bin'] = train_df['TD006_bin'].fillna("NoData").astype(str)
train_df['TD006_bin'].value_counts(dropna=False)
(-0.001, 1.0] 41782 (1.0, 2.0] 12417 (2.0, 21.0] 9801 Name: TD006_bin, dtype: int64
k = WOE('TD006_bin')
k
| TD006_bin | Count | Good | Bad | Good % | Bad % | TD006_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | (-0.001, 1.0] | 41782 | 7162 | 34620 | 58.05 | 67.01 | -0.14 |
| 1 | (1.0, 2.0] | 12417 | 2610 | 9807 | 21.15 | 18.98 | 0.11 |
| 2 | (2.0, 21.0] | 9801 | 2566 | 7235 | 20.80 | 14.00 | 0.40 |
#Append the WOE value of each category back to the original train data
train_df_WOE_TD006 = pd.merge(train_df,k[['TD006_bin','TD006_bin_WOE']],
left_on='TD006_bin',
right_on='TD006_bin',how='left')
train_df_WOE_TD006.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD006 | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD006_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 3 | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | 0.40 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 0 | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | -0.14 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 1 | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | -0.14 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 3 | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | 0.40 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 0 | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | -0.14 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 3 | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | 0.40 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 1 | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | -0.14 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 1 | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | -0.14 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 1 | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | -0.14 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 3 | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | 0.40 |
10 rows × 23 columns
nan_check= train_df_WOE_TD006['TD006_bin_WOE'].isna()
nan_values = train_df_WOE_TD006['TD006_bin_WOE'][nan_check]
nan_values
Series([], Name: TD006_bin_WOE, dtype: float64)
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(1.0, 2.0]","(2.0, 21.0]"]
# Bin the test data with the specified labels
test_df['TD006_bin_labels'] = pd.qcut(test_df['TD006'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD006_bin'] = pd.qcut(test_df['TD006'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD006_bin'] = test_df['TD006_bin'].fillna("NoData")
# Print the value counts
test_df['TD006_bin'].value_counts(dropna=False)
(-0.001, 1.0] 10475 (1.0, 2.0] 3110 (2.0, 21.0] 2415 Name: TD006_bin, dtype: int64
test_df_WOE_TD006 = pd.merge(test_df, k[['TD006_bin', 'TD006_bin_WOE']],
left_on='TD006_bin',
right_on='TD006_bin', how='left')
test_df_WOE_TD006.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | AP003_bin | CR009_bin_labels | CR009_bin | CR015_bin_labels | CR015_bin | TD001_bin_labels | TD001_bin | TD006_bin_labels | TD006_bin | TD006_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | -0.14 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | -0.14 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | (3.0, 6.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | -0.14 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | 2 | (2.0, 21.0] | 0.40 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | (0.999, 3.0] | 3 | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 3 | (2.0, 3.0] | 1 | (1.0, 2.0] | 0.11 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | (3.0, 6.0] | 1 | (-0.001, 2500.0] | 2 | (1.999, 4.0] | 3 | (2.0, 3.0] | 0 | (-0.001, 1.0] | -0.14 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | (0.999, 3.0] | 1 | (-0.001, 2500.0] | 0 | (4.0, 5.0] | 3 | (2.0, 3.0] | 2 | (2.0, 21.0] | 0.40 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | (0.999, 3.0] | 0 | (24221.8, 50000.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | -0.14 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | (0.999, 3.0] | 2 | (11484.4, 24221.8] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | -0.14 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | (0.999, 3.0] | 4 | (2500.0, 11484.4] | 1 | (5.0, 6.0] | 2 | (3.0, 20.0] | 0 | (-0.001, 1.0] | -0.14 |
10 rows × 28 columns
nan_check = test_df_WOE_TD006['TD006_bin_WOE'].isna()
nan_values = test_df_WOE_TD006['TD006_bin_WOE'][nan_check]
nan_values
Series([], Name: TD006_bin_WOE, dtype: float64)
k = WOE('TD009')
k
| TD009 | Count | Good | Bad | Good % | Bad % | TD009_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 2951 | 340 | 2611 | 2.76 | 5.05 | -0.60 |
| 1 | 1 | 3838 | 443 | 3395 | 3.59 | 6.57 | -0.60 |
| 2 | 2 | 9131 | 1255 | 7876 | 10.17 | 15.25 | -0.41 |
| 3 | 3 | 8864 | 1388 | 7476 | 11.25 | 14.47 | -0.25 |
| 4 | 4 | 7712 | 1367 | 6345 | 11.08 | 12.28 | -0.10 |
| 27 | 27 | 13 | 2 | 11 | 0.02 | 0.02 | 0.00 |
| 31 | 31 | 4 | 1 | 3 | 0.01 | 0.01 | 0.00 |
| 5 | 5 | 6316 | 1254 | 5062 | 10.16 | 9.80 | 0.04 |
| 6 | 6 | 5198 | 1083 | 4115 | 8.78 | 7.97 | 0.10 |
| 7 | 7 | 4339 | 959 | 3380 | 7.77 | 6.54 | 0.17 |
| 8 | 8 | 3458 | 818 | 2640 | 6.63 | 5.11 | 0.26 |
| 23 | 23 | 53 | 14 | 39 | 0.11 | 0.08 | 0.32 |
| 9 | 9 | 2941 | 746 | 2195 | 6.05 | 4.25 | 0.35 |
| 16 | 16 | 367 | 95 | 272 | 0.77 | 0.53 | 0.37 |
| 13 | 13 | 1010 | 265 | 745 | 2.15 | 1.44 | 0.40 |
| 10 | 10 | 2280 | 611 | 1669 | 4.95 | 3.23 | 0.43 |
| 11 | 11 | 1711 | 461 | 1250 | 3.74 | 2.42 | 0.44 |
| 18 | 18 | 206 | 57 | 149 | 0.46 | 0.29 | 0.46 |
| 12 | 12 | 1420 | 419 | 1001 | 3.40 | 1.94 | 0.56 |
| 14 | 14 | 803 | 256 | 547 | 2.07 | 1.06 | 0.67 |
| 17 | 17 | 296 | 95 | 201 | 0.77 | 0.39 | 0.68 |
| 29 | 29 | 5 | 2 | 3 | 0.02 | 0.01 | 0.69 |
| 30 | 30 | 7 | 3 | 4 | 0.02 | 0.01 | 0.69 |
| 19 | 19 | 161 | 54 | 107 | 0.44 | 0.21 | 0.74 |
| 15 | 15 | 529 | 183 | 346 | 1.48 | 0.67 | 0.79 |
| 25 | 25 | 42 | 16 | 26 | 0.13 | 0.05 | 0.96 |
| 20 | 20 | 119 | 48 | 71 | 0.39 | 0.14 | 1.02 |
| 26 | 26 | 24 | 11 | 13 | 0.09 | 0.03 | 1.10 |
| 21 | 21 | 91 | 38 | 53 | 0.31 | 0.10 | 1.13 |
| 22 | 22 | 58 | 26 | 32 | 0.21 | 0.06 | 1.25 |
| 24 | 24 | 28 | 15 | 13 | 0.12 | 0.03 | 1.39 |
| 28 | 28 | 14 | 8 | 6 | 0.06 | 0.01 | 1.79 |
| 34 | 34 | 3 | 2 | 1 | 0.02 | 0.00 | inf |
| 32 | 32 | 3 | 2 | 1 | 0.02 | 0.00 | inf |
| 37 | 46 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 33 | 33 | 2 | 0 | 2 | 0.00 | 0.00 | NaN |
| 35 | 36 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 36 | 38 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
#Bin the train data
train_df['TD009_bin'] = pd.qcut(train_df['TD009'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD009_bin'] = train_df['TD009_bin'].fillna("NoData").astype(str)
train_df['TD009_bin'].value_counts(dropna=False)
(2.0, 4.0] 16576 (-0.001, 2.0] 15920 (5.0, 8.0] 12995 (8.0, 46.0] 12193 (4.0, 5.0] 6316 Name: TD009_bin, dtype: int64
k = WOE('TD009_bin')
k
| TD009_bin | Count | Good | Bad | Good % | Bad % | TD009_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | (-0.001, 2.0] | 15920 | 2038 | 13882 | 16.52 | 26.87 | -0.49 |
| 1 | (2.0, 4.0] | 16576 | 2755 | 13821 | 22.33 | 26.75 | -0.18 |
| 2 | (4.0, 5.0] | 6316 | 1254 | 5062 | 10.16 | 9.80 | 0.04 |
| 3 | (5.0, 8.0] | 12995 | 2860 | 10135 | 23.18 | 19.62 | 0.17 |
| 4 | (8.0, 46.0] | 12193 | 3431 | 8762 | 27.81 | 16.96 | 0.49 |
#Append the WOE value of each category back to the original train data
train_df_WOE_TD009 = pd.merge(train_df,k[['TD009_bin','TD009_bin_WOE']],
left_on='TD009_bin',
right_on='TD009_bin',how='left')
train_df_WOE_TD009.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD009 | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD009_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 14 | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | 0.49 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 2 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | -0.49 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 6 | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | 0.17 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 9 | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | 0.49 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 2 | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | -0.49 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 11 | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | 0.49 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 6 | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | (5.0, 8.0] | 0.17 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 6 | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | (5.0, 8.0] | 0.17 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 10 | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (8.0, 46.0] | 0.49 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 5 | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (4.0, 5.0] | 0.04 |
10 rows × 24 columns
nan_check= train_df_WOE_TD009['TD009_bin_WOE'].isna()
nan_values = train_df_WOE_TD009['TD009_bin_WOE'][nan_check]
nan_values
Series([], Name: TD009_bin_WOE, dtype: float64)
# Define the desired bin labels
bin_labels = ["(2.0, 4.0]", "(-0.001, 2.0]", "(5.0, 8.0]", "(8.0, 46.0]", "(4.0, 5.0]"]
# Bin the test data with the specified labels
test_df['TD009_bin_labels'] = pd.qcut(test_df['TD009'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD009_bin'] = pd.qcut(test_df['TD009'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD009_bin'] = test_df['TD009_bin'].fillna("NoData")
# Print the value counts
test_df['TD009_bin'].value_counts(dropna=False)
(-0.001, 2.0] 4146 (2.0, 4.0] 4078 (8.0, 46.0] 3150 (4.0, 5.0] 3067 (5.0, 8.0] 1559 Name: TD009_bin, dtype: int64
#Append the WOE table to the test data
test_df_WOE_TD009 = pd.merge(test_df,k[['TD009_bin','TD009_bin_WOE']],
left_on='TD009_bin',
right_on='TD009_bin',how='left')
test_df_WOE_TD009.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | CR009_bin | CR015_bin_labels | CR015_bin | TD001_bin_labels | TD001_bin | TD006_bin_labels | TD006_bin | TD009_bin_labels | TD009_bin | TD009_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 0 | (2.0, 4.0] | -0.18 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0.49 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | -0.49 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | (-0.001, 2500.0] | 1 | (5.0, 6.0] | 0 | (-0.001, 1.0] | 2 | (2.0, 21.0] | 0 | (2.0, 4.0] | -0.18 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | (50000.0, 1420300.0] | 1 | (5.0, 6.0] | 3 | (2.0, 3.0] | 1 | (1.0, 2.0] | 4 | (4.0, 5.0] | 0.04 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | (-0.001, 2500.0] | 2 | (1.999, 4.0] | 3 | (2.0, 3.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0.49 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | (-0.001, 2500.0] | 0 | (4.0, 5.0] | 3 | (2.0, 3.0] | 2 | (2.0, 21.0] | 4 | (4.0, 5.0] | 0.04 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | (24221.8, 50000.0] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | -0.49 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | (11484.4, 24221.8] | 1 | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0.49 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | (2500.0, 11484.4] | 1 | (5.0, 6.0] | 2 | (3.0, 20.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0.49 |
10 rows × 30 columns
nan_check = test_df_WOE_TD009['TD009_bin_WOE'].isna()
nan_values = test_df_WOE_TD009['TD009_bin_WOE'][nan_check]
nan_values
Series([], Name: TD009_bin_WOE, dtype: float64)
k = WOE('TD010')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| TD010 | Count | Good | Bad | Good % | Bad % | TD010_WOE | |
|---|---|---|---|---|---|---|---|
| 19 | 19 | 4 | 0 | 4 | 0.00 | 0.01 | -inf |
| 0 | 0 | 12378 | 1879 | 10499 | 15.23 | 20.32 | -0.29 |
| 1 | 1 | 18591 | 3008 | 15583 | 24.38 | 30.16 | -0.21 |
| 2 | 2 | 13916 | 2689 | 11227 | 21.79 | 21.73 | 0.00 |
| 3 | 3 | 8472 | 1859 | 6613 | 15.07 | 12.80 | 0.16 |
| 4 | 4 | 4696 | 1161 | 3535 | 9.41 | 6.84 | 0.32 |
| 5 | 5 | 2601 | 718 | 1883 | 5.82 | 3.64 | 0.47 |
| 6 | 6 | 1406 | 394 | 1012 | 3.19 | 1.96 | 0.49 |
| 7 | 7 | 772 | 230 | 542 | 1.86 | 1.05 | 0.57 |
| 8 | 8 | 430 | 128 | 302 | 1.04 | 0.58 | 0.58 |
| 22 | 22 | 6 | 3 | 3 | 0.02 | 0.01 | 0.69 |
| 12 | 12 | 63 | 20 | 43 | 0.16 | 0.08 | 0.69 |
| 9 | 9 | 232 | 78 | 154 | 0.63 | 0.30 | 0.74 |
| 13 | 13 | 55 | 20 | 35 | 0.16 | 0.07 | 0.83 |
| 11 | 11 | 112 | 39 | 73 | 0.32 | 0.14 | 0.83 |
| 10 | 10 | 147 | 54 | 93 | 0.44 | 0.18 | 0.89 |
| 15 | 15 | 19 | 8 | 11 | 0.06 | 0.02 | 1.10 |
| 18 | 18 | 9 | 4 | 5 | 0.03 | 0.01 | 1.10 |
| 14 | 14 | 49 | 22 | 27 | 0.18 | 0.05 | 1.28 |
| 16 | 16 | 13 | 7 | 6 | 0.06 | 0.01 | 1.79 |
| 17 | 17 | 12 | 8 | 4 | 0.06 | 0.01 | 1.79 |
| 20 | 20 | 3 | 1 | 2 | 0.01 | 0.00 | inf |
| 21 | 21 | 3 | 2 | 1 | 0.02 | 0.00 | inf |
| 24 | 24 | 4 | 3 | 1 | 0.02 | 0.00 | inf |
| 25 | 25 | 2 | 1 | 1 | 0.01 | 0.00 | inf |
| 28 | 30 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 29 | 35 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 23 | 23 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 26 | 26 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 27 | 28 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
#Bin the train data
train_df['TD010_bin'] = pd.qcut(train_df['TD010'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD010_bin'] = train_df['TD010_bin'].fillna("NoData").astype(str)
train_df['TD010_bin'].value_counts(dropna=False)
(-0.001, 1.0] 30969 (1.0, 2.0] 13916 (3.0, 35.0] 10643 (2.0, 3.0] 8472 Name: TD010_bin, dtype: int64
k = WOE('TD010_bin')
k
| TD010_bin | Count | Good | Bad | Good % | Bad % | TD010_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | (-0.001, 1.0] | 30969 | 4887 | 26082 | 39.61 | 50.49 | -0.24 |
| 1 | (1.0, 2.0] | 13916 | 2689 | 11227 | 21.79 | 21.73 | 0.00 |
| 2 | (2.0, 3.0] | 8472 | 1859 | 6613 | 15.07 | 12.80 | 0.16 |
| 3 | (3.0, 35.0] | 10643 | 2903 | 7740 | 23.53 | 14.98 | 0.45 |
#Append the WOE value of each category back to the original train data
train_df_WOE_TD010 = pd.merge(train_df,k[['TD010_bin','TD010_bin_WOE']],
left_on='TD010_bin',
right_on='TD010_bin',how='left')
train_df_WOE_TD010.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD010 | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD010_bin | TD010_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 5 | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | 0.45 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | -0.24 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 2 | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | (1.0, 2.0] | 0.00 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 3 | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | 0.16 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 0 | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | -0.24 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 4 | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | 0.45 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 3 | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | (5.0, 8.0] | (2.0, 3.0] | 0.16 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 1 | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | (5.0, 8.0] | (-0.001, 1.0] | -0.24 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 1 | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (8.0, 46.0] | (-0.001, 1.0] | -0.24 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 3 | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (4.0, 5.0] | (2.0, 3.0] | 0.16 |
10 rows × 25 columns
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(1.0, 2.0]", "(3.0, 35.0]", "(2.0, 3.0]"]
# Bin the test data with the specified labels
test_df['TD010_bin_labels'] = pd.qcut(test_df['TD010'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD010_bin'] = pd.qcut(test_df['TD010'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD010_bin'] = test_df['TD010_bin'].fillna("NoData")
# Print the value counts
test_df['TD010_bin'].value_counts(dropna=False)
(-0.001, 1.0] 7789 (1.0, 2.0] 3472 (2.0, 3.0] 2665 (3.0, 35.0] 2074 Name: TD010_bin, dtype: int64
#Append the WOE table to the test data
test_df_WOE_TD010 = pd.merge(test_df,k[['TD010_bin','TD010_bin_WOE']],
left_on='TD010_bin',
right_on='TD010_bin',how='left')
test_df_WOE_TD010.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | CR015_bin | TD001_bin_labels | TD001_bin | TD006_bin_labels | TD006_bin | TD009_bin_labels | TD009_bin | TD010_bin_labels | TD010_bin | TD010_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 0 | (2.0, 4.0] | 0 | (-0.001, 1.0] | -0.24 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0 | (-0.001, 1.0] | -0.24 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | (5.0, 6.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | -0.24 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | (5.0, 6.0] | 0 | (-0.001, 1.0] | 2 | (2.0, 21.0] | 0 | (2.0, 4.0] | 2 | (3.0, 35.0] | 0.45 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | (5.0, 6.0] | 3 | (2.0, 3.0] | 1 | (1.0, 2.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 0.16 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | (1.999, 4.0] | 3 | (2.0, 3.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 2 | (3.0, 35.0] | 0.45 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | (4.0, 5.0] | 3 | (2.0, 3.0] | 2 | (2.0, 21.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 0.16 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | -0.24 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | (5.0, 6.0] | 1 | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 0.00 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | (5.0, 6.0] | 2 | (3.0, 20.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 0.00 |
10 rows × 32 columns
nan_check = train_df_WOE_TD010['TD010_bin_WOE'].isna()
nan_values = train_df_WOE_TD010['TD010_bin_WOE'][nan_check]
nan_values
Series([], Name: TD010_bin_WOE, dtype: float64)
nan_check = test_df_WOE_TD010['TD010_bin_WOE'].isna()
nan_values = test_df_WOE_TD010['TD010_bin_WOE'][nan_check]
nan_values
Series([], Name: TD010_bin_WOE, dtype: float64)
k = WOE('TD014')
k
| TD014 | Count | Good | Bad | Good % | Bad % | TD014_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 9486 | 1357 | 8129 | 11.00 | 15.73 | -0.36 |
| 1 | 1 | 15573 | 2400 | 13173 | 19.45 | 25.50 | -0.27 |
| 2 | 2 | 13366 | 2407 | 10959 | 19.51 | 21.21 | -0.08 |
| 3 | 3 | 9156 | 1856 | 7300 | 15.04 | 14.13 | 0.06 |
| 4 | 4 | 5967 | 1414 | 4553 | 11.46 | 8.81 | 0.26 |
| 18 | 18 | 21 | 5 | 16 | 0.04 | 0.03 | 0.29 |
| 5 | 5 | 3755 | 941 | 2814 | 7.63 | 5.45 | 0.34 |
| 6 | 6 | 2332 | 625 | 1707 | 5.07 | 3.30 | 0.43 |
| 7 | 7 | 1465 | 422 | 1043 | 3.42 | 2.02 | 0.53 |
| 10 | 10 | 408 | 117 | 291 | 0.95 | 0.56 | 0.53 |
| 13 | 13 | 120 | 35 | 85 | 0.28 | 0.16 | 0.56 |
| 9 | 9 | 630 | 188 | 442 | 1.52 | 0.86 | 0.57 |
| 8 | 8 | 938 | 288 | 650 | 2.33 | 1.26 | 0.61 |
| 11 | 11 | 274 | 88 | 186 | 0.71 | 0.36 | 0.68 |
| 12 | 12 | 182 | 60 | 122 | 0.49 | 0.24 | 0.71 |
| 14 | 14 | 103 | 36 | 67 | 0.29 | 0.13 | 0.80 |
| 19 | 19 | 16 | 6 | 10 | 0.05 | 0.02 | 0.92 |
| 20 | 20 | 17 | 6 | 11 | 0.05 | 0.02 | 0.92 |
| 16 | 16 | 50 | 20 | 30 | 0.16 | 0.06 | 0.98 |
| 21 | 21 | 11 | 4 | 7 | 0.03 | 0.01 | 1.10 |
| 15 | 15 | 60 | 26 | 34 | 0.21 | 0.07 | 1.10 |
| 17 | 17 | 35 | 18 | 17 | 0.15 | 0.03 | 1.61 |
| 31 | 36 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 22 | 22 | 9 | 7 | 2 | 0.06 | 0.00 | inf |
| 23 | 23 | 4 | 2 | 2 | 0.02 | 0.00 | inf |
| 24 | 24 | 5 | 3 | 2 | 0.02 | 0.00 | inf |
| 25 | 25 | 4 | 2 | 2 | 0.02 | 0.00 | inf |
| 26 | 26 | 2 | 1 | 1 | 0.01 | 0.00 | inf |
| 27 | 28 | 4 | 2 | 2 | 0.02 | 0.00 | inf |
| 32 | 43 | 1 | 1 | 0 | 0.01 | 0.00 | inf |
| 28 | 30 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 29 | 31 | 2 | 0 | 2 | 0.00 | 0.00 | NaN |
| 30 | 32 | 2 | 0 | 2 | 0.00 | 0.00 | NaN |
#Bin the train data
train_df['TD014_bin'] = pd.qcut(train_df['TD014'],5,duplicates='drop').values.add_categories("NoData")
train_df['TD014_bin'] = train_df['TD014_bin'].fillna("NoData").astype(str)
train_df['TD014_bin'].value_counts(dropna=False)
(-0.001, 1.0] 25059 (2.0, 4.0] 15123 (1.0, 2.0] 13366 (4.0, 43.0] 10452 Name: TD014_bin, dtype: int64
k = WOE('TD014_bin')
k
| TD014_bin | Count | Good | Bad | Good % | Bad % | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 0 | (-0.001, 1.0] | 25059 | 3757 | 21302 | 30.45 | 41.23 | -0.30 |
| 1 | (1.0, 2.0] | 13366 | 2407 | 10959 | 19.51 | 21.21 | -0.08 |
| 2 | (2.0, 4.0] | 15123 | 3270 | 11853 | 26.50 | 22.94 | 0.14 |
| 3 | (4.0, 43.0] | 10452 | 2904 | 7548 | 23.54 | 14.61 | 0.48 |
#Append the WOE value of each category back to the original train data
train_df_WOE_TD014 = pd.merge(train_df,k[['TD014_bin','TD014_bin_WOE']],
left_on='TD014_bin',
right_on='TD014_bin',how='left')
train_df_WOE_TD014.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD010_bin | TD014_bin | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (4.0, 43.0] | 0.48 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | -0.30 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | (1.0, 2.0] | (1.0, 2.0] | -0.08 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | (2.0, 4.0] | 0.14 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | -0.30 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | 4 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (2.0, 4.0] | 0.14 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | 3 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | (5.0, 8.0] | (2.0, 3.0] | (2.0, 4.0] | 0.14 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | (5.0, 8.0] | (-0.001, 1.0] | (-0.001, 1.0] | -0.30 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (8.0, 46.0] | (-0.001, 1.0] | (1.0, 2.0] | -0.08 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | 3 | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (4.0, 5.0] | (2.0, 3.0] | (2.0, 4.0] | 0.14 |
10 rows × 26 columns
nan_check = train_df_WOE_TD014['TD014_bin_WOE'].isna()
nan_values = train_df_WOE_TD014['TD014_bin_WOE'][nan_check]
nan_values
Series([], Name: TD014_bin_WOE, dtype: float64)
# Define the desired bin labels
bin_labels = ["(-0.001, 1.0]", "(2.0, 4.0]", "(1.0, 2.0]", "(4.0, 43.0]"]
# Bin the test data with the specified labels
test_df['TD014_bin_labels'] = pd.qcut(test_df['TD014'], 5, duplicates='drop', labels=False)
# Map the bin labels to the original binning ranges
test_df['TD014_bin'] = pd.qcut(test_df['TD014'], 5, duplicates='drop', labels=bin_labels)
# Replace NaN values with "NoData"
test_df['TD014_bin'] = test_df['TD014_bin'].fillna("NoData")
# Print the value counts
test_df['TD014_bin'].value_counts(dropna=False)
(-0.001, 1.0] 6323 (1.0, 2.0] 3809 (2.0, 4.0] 3279 (4.0, 43.0] 2589 Name: TD014_bin, dtype: int64
#Append the WOE table to the test data
test_df_WOE_TD014 = pd.merge(test_df,k[['TD014_bin','TD014_bin_WOE']],
left_on='TD014_bin',
right_on='TD014_bin',how='left')
test_df_WOE_TD014.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD001_bin | TD006_bin_labels | TD006_bin | TD009_bin_labels | TD009_bin | TD010_bin_labels | TD010_bin | TD014_bin_labels | TD014_bin | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | (1.0, 2.0] | 0 | (-0.001, 1.0] | 0 | (2.0, 4.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | -0.30 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | 0.14 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | (-0.001, 1.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | -0.30 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | (-0.001, 1.0] | 2 | (2.0, 21.0] | 0 | (2.0, 4.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | -0.08 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | (2.0, 3.0] | 1 | (1.0, 2.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | 0.48 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | (2.0, 3.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | -0.08 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | (2.0, 3.0] | 2 | (2.0, 21.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | 0.48 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | (1.0, 2.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | 0.14 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 3 | (4.0, 43.0] | 0.48 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | (3.0, 20.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 2 | (1.0, 2.0] | -0.08 |
10 rows × 34 columns
nan_check = test_df_WOE_TD014['TD014_bin_WOE'].isna()
nan_values = test_df_WOE_TD014['TD014_bin_WOE'][nan_check]
nan_values
Series([], Name: TD014_bin_WOE, dtype: float64)
k = WOE('PA022')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| PA022 | Count | Good | Bad | Good % | Bad % | PA022_WOE | |
|---|---|---|---|---|---|---|---|
| 125 | 123.0 | 4 | 0 | 4 | 0.00 | 0.01 | -inf |
| 123 | 121.0 | 10 | 1 | 9 | 0.01 | 0.02 | -0.69 |
| 53 | 51.0 | 177 | 23 | 154 | 0.19 | 0.30 | -0.46 |
| 0 | -99.0 | 1196 | 179 | 1017 | 1.45 | 1.97 | -0.31 |
| 74 | 72.0 | 195 | 29 | 166 | 0.24 | 0.32 | -0.29 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 155 | 426.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 157 | 437.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 158 | 440.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 159 | 441.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 161 | 448.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
163 rows × 7 columns
#Bin the train data
#Convert the 'PA022' column to numeric values, and any non-numeric values (including 'NoData') will be replaced with NaN using the errors='coerce'
train_df['PA022'] = pd.to_numeric(train_df['PA022'], errors='coerce')
train_df['PA022_bin'] = pd.qcut(train_df['PA022'],5,duplicates='drop').values.add_categories("NoData")
train_df['PA022_bin'] = train_df['PA022_bin'].fillna("NoData").astype(str)
train_df['PA022_bin'].value_counts(dropna=False)
(-99.001, -1.0] 41766 (59.0, 448.0] 12644 (-1.0, 59.0] 9278 NoData 312 Name: PA022_bin, dtype: int64
train_df
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD014 | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD010_bin | TD014_bin | PA022_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3822 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | 5 | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (4.0, 43.0] | (-99.001, -1.0] |
| 35562 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | 1 | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (59.0, 448.0] |
| 4883 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | 2 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | (1.0, 2.0] | (1.0, 2.0] | (-99.001, -1.0] |
| 71170 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | 3 | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] |
| 25665 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | 0 | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6265 | 6266 | 0 | 25 | 3 | 3 | 12000 | 5 | 3 | -1.0 | -1.0 | ... | 2 | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (4.0, 5.0] | (-0.001, 1.0] | (1.0, 2.0] | (-99.001, -1.0] |
| 54886 | 54887 | 0 | 31 | 3 | 4 | 60300 | 6 | 5 | 69.0 | -1.0 | ... | 1 | (0.999, 3.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (4.0, 5.0] | (-0.001, 1.0] | (-0.001, 1.0] | (59.0, 448.0] |
| 76820 | 76821 | 0 | 28 | 3 | 2 | 45167 | 5 | 3 | -1.0 | -1.0 | ... | 3 | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] |
| 860 | 861 | 1 | 28 | 1 | 5 | 59111 | 6 | 11 | -1.0 | -1.0 | ... | 2 | (0.999, 3.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (-0.001, 1.0] | (1.0, 2.0] | (5.0, 8.0] | (1.0, 2.0] | (1.0, 2.0] | (-99.001, -1.0] |
| 15795 | 15796 | 0 | 27 | 1 | 4 | 2878 | 5 | 2 | -1.0 | -1.0 | ... | 1 | (0.999, 3.0] | (2500.0, 11484.4] | (4.0, 5.0] | (-0.001, 1.0] | (-0.001, 1.0] | (2.0, 4.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] |
64000 rows × 26 columns
k = WOE('PA022_bin')
k
| PA022_bin | Count | Good | Bad | Good % | Bad % | PA022_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 1 | (-99.001, -1.0] | 41766 | 7093 | 34673 | 57.49 | 67.12 | -0.15 |
| 0 | (-1.0, 59.0] | 9278 | 2121 | 7157 | 17.19 | 13.85 | 0.22 |
| 2 | (59.0, 448.0] | 12644 | 3045 | 9599 | 24.68 | 18.58 | 0.28 |
| 3 | NoData | 312 | 79 | 233 | 0.64 | 0.45 | 0.35 |
train_df_WOE_PA022 = pd.merge(train_df, k[['PA022_bin', 'PA022_bin_WOE']],
left_on='PA022_bin',
right_on='PA022_bin', how='left')
train_df_WOE_PA022.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | AP003_bin | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD010_bin | TD014_bin | PA022_bin | PA022_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | (3.0, 6.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (4.0, 43.0] | (-99.001, -1.0] | -0.15 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (59.0, 448.0] | 0.28 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | (1.0, 2.0] | (1.0, 2.0] | (-99.001, -1.0] | -0.15 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | (0.999, 3.0] | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] | -0.15 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | (3.0, 6.0] | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] | -0.15 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | (0.999, 3.0] | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (2.0, 4.0] | (-99.001, -1.0] | -0.15 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | (0.999, 3.0] | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | (5.0, 8.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] | -0.15 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | (0.999, 3.0] | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | (5.0, 8.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] | -0.15 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | (0.999, 3.0] | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (8.0, 46.0] | (-0.001, 1.0] | (1.0, 2.0] | (-1.0, 59.0] | 0.22 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | (0.999, 3.0] | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (4.0, 5.0] | (2.0, 3.0] | (2.0, 4.0] | (59.0, 448.0] | 0.28 |
10 rows × 27 columns
nan_check= train_df_WOE_PA022['PA022_bin_WOE'].isna()
nan_values = train_df_WOE_PA022['PA022_bin_WOE'][nan_check]
nan_values
Series([], Name: PA022_bin_WOE, dtype: float64)
test_df['PA022'] = pd.to_numeric(test_df['PA022'], errors='coerce')
test_df['PA022_bin'] = pd.qcut(test_df['PA022'],5,duplicates='drop').values.add_categories("NoData")
test_df['PA022_bin'] = test_df['PA022_bin'].fillna("NoData").astype(str)
test_df['PA022_bin'].value_counts(dropna=False)
(-99.001, -1.0] 10407 (57.0, 434.0] 3147 (-1.0, 57.0] 2377 NoData 69 Name: PA022_bin, dtype: int64
test_df.head()
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD001_bin | TD006_bin_labels | TD006_bin | TD009_bin_labels | TD009_bin | TD010_bin_labels | TD010_bin | TD014_bin_labels | TD014_bin | PA022_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47044 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | (1.0, 2.0] | 0 | (-0.001, 1.0] | 0 | (2.0, 4.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 57.0] |
| 44295 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | (1.0, 2.0] | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (-99.001, -1.0] |
| 74783 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | (-0.001, 1.0] | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 57.0] |
| 70975 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | (-0.001, 1.0] | 2 | (2.0, 21.0] | 0 | (2.0, 4.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (57.0, 434.0] |
| 46645 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | (2.0, 3.0] | 1 | (1.0, 2.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (57.0, 434.0] |
5 rows × 34 columns
# Replace values in a column with another set of values
test_df['PA022_bin'] = test_df['PA022_bin'].replace({"(-1.0, 57.0]": '(-1.0, 59.0]', '(57.0, 434.0]': '(59.0, 448.0]'})
test_df['PA022_bin'].value_counts(dropna=False)
(-99.001, -1.0] 10407 (59.0, 448.0] 3147 (-1.0, 59.0] 2377 NoData 69 Name: PA022_bin, dtype: int64
test_df_WOE_PA022 = pd.merge(test_df, k[['PA022_bin', 'PA022_bin_WOE']],
left_on='PA022_bin',
right_on='PA022_bin', how='left')
test_df_WOE_PA022.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD006_bin_labels | TD006_bin | TD009_bin_labels | TD009_bin | TD010_bin_labels | TD010_bin | TD014_bin_labels | TD014_bin | PA022_bin | PA022_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 0 | (-0.001, 1.0] | 0 | (2.0, 4.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 59.0] | 0.22 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (-99.001, -1.0] | -0.15 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 59.0] | 0.22 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 2 | (2.0, 21.0] | 0 | (2.0, 4.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (59.0, 448.0] | 0.28 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 1 | (1.0, 2.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (59.0, 448.0] | 0.28 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (-99.001, -1.0] | -0.15 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 2 | (2.0, 21.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (59.0, 448.0] | 0.28 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 0 | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (59.0, 448.0] | 0.28 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 3 | (4.0, 43.0] | (-99.001, -1.0] | -0.15 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 0 | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 2 | (1.0, 2.0] | (-1.0, 59.0] | 0.22 |
10 rows × 35 columns
nan_check = test_df_WOE_PA022['PA022_bin_WOE'].isna()
nan_values = test_df_WOE_PA022['PA022_bin_WOE'][nan_check]
nan_values
Series([], Name: PA022_bin_WOE, dtype: float64)
k = WOE('PA022')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| PA022 | Count | Good | Bad | Good % | Bad % | PA022_WOE | |
|---|---|---|---|---|---|---|---|
| 125 | 123.0 | 4 | 0 | 4 | 0.00 | 0.01 | -inf |
| 123 | 121.0 | 10 | 1 | 9 | 0.01 | 0.02 | -0.69 |
| 53 | 51.0 | 177 | 23 | 154 | 0.19 | 0.30 | -0.46 |
| 0 | -99.0 | 1196 | 179 | 1017 | 1.45 | 1.97 | -0.31 |
| 74 | 72.0 | 195 | 29 | 166 | 0.24 | 0.32 | -0.29 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 155 | 426.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 157 | 437.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 158 | 440.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 159 | 441.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
| 161 | 448.0 | 1 | 0 | 1 | 0.00 | 0.00 | NaN |
163 rows × 7 columns
#Bin the train data
#Convert the 'PA023' column to numeric values, and any non-numeric values (including 'NoData') will be replaced with NaN using the errors='coerce'
train_df['PA023'] = pd.to_numeric(train_df['PA023'], errors='coerce')
train_df['PA023_bin'] = pd.qcut(train_df['PA023'],5,duplicates='drop').values.add_categories("NoData")
train_df['PA023_bin'] = train_df['PA023_bin'].fillna("NoData").astype(str)
train_df['PA023_bin'].value_counts(dropna=False)
(-99.001, -1.0] 46059 (41.0, 448.0] 12715 (-1.0, 41.0] 4914 NoData 312 Name: PA023_bin, dtype: int64
k = WOE('PA023_bin')
k
| PA023_bin | Count | Good | Bad | Good % | Bad % | PA023_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 1 | (-99.001, -1.0] | 46059 | 7997 | 38062 | 64.82 | 73.68 | -0.13 |
| 0 | (-1.0, 41.0] | 4914 | 1165 | 3749 | 9.44 | 7.26 | 0.26 |
| 2 | (41.0, 448.0] | 12715 | 3097 | 9618 | 25.10 | 18.62 | 0.30 |
| 3 | NoData | 312 | 79 | 233 | 0.64 | 0.45 | 0.35 |
train_df_WOE_PA023 = pd.merge(train_df, k[['PA023_bin','PA023_bin_WOE']],
left_on='PA023_bin',
right_on='PA023_bin', how='left')
train_df_WOE_PA023.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | CR009_bin | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD010_bin | TD014_bin | PA022_bin | PA023_bin | PA023_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (4.0, 43.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | (-0.001, 2500.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (59.0, 448.0] | (41.0, 448.0] | 0.30 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | (24221.8, 50000.0] | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | (1.0, 2.0] | (1.0, 2.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | (11484.4, 24221.8] | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | (50000.0, 1420300.0] | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | (24221.8, 50000.0] | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | (-0.001, 2500.0] | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | (5.0, 8.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | (-0.001, 2500.0] | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | (5.0, 8.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | (11484.4, 24221.8] | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (8.0, 46.0] | (-0.001, 1.0] | (1.0, 2.0] | (-1.0, 59.0] | (-1.0, 41.0] | 0.26 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | (50000.0, 1420300.0] | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (4.0, 5.0] | (2.0, 3.0] | (2.0, 4.0] | (59.0, 448.0] | (-99.001, -1.0] | -0.13 |
10 rows × 28 columns
nan_check= train_df_WOE_PA023['PA023_bin_WOE'].isna()
nan_values = train_df_WOE_PA023['PA023_bin_WOE'][nan_check]
nan_values
Series([], Name: PA023_bin_WOE, dtype: float64)
test_df['PA023'] = pd.to_numeric(test_df['PA023'], errors='coerce')
test_df['PA023_bin'] = pd.qcut(test_df['PA023'],5,duplicates='drop').values.add_categories("NoData")
test_df['PA023_bin'] = test_df['PA023_bin'].fillna("NoData").astype(str)
test_df['PA023_bin'].value_counts(dropna=False)
(-99.001, -1.0] 11479 (39.0, 434.0] 3174 (-1.0, 39.0] 1278 NoData 69 Name: PA023_bin, dtype: int64
# Replace values in a column with another set of values
test_df['PA023_bin'] = test_df['PA023_bin'].replace({"(-1.0, 39.0]": "(-1.0, 41.0]","(39.0, 434.0]": "(41.0, 448.0]"})
test_df['PA023_bin'].value_counts(dropna=False)
(-99.001, -1.0] 11479 (41.0, 448.0] 3174 (-1.0, 41.0] 1278 NoData 69 Name: PA023_bin, dtype: int64
test_df_WOE_PA023 = pd.merge(test_df, k[['PA023_bin', 'PA023_bin_WOE']],
left_on='PA023_bin',
right_on='PA023_bin', how='left')
test_df_WOE_PA023.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD006_bin | TD009_bin_labels | TD009_bin | TD010_bin_labels | TD010_bin | TD014_bin_labels | TD014_bin | PA022_bin | PA023_bin | PA023_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | (-0.001, 1.0] | 0 | (2.0, 4.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 59.0] | (-1.0, 41.0] | 0.26 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | (-0.001, 1.0] | 3 | (8.0, 46.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 59.0] | (41.0, 448.0] | 0.30 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | (2.0, 21.0] | 0 | (2.0, 4.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (59.0, 448.0] | (41.0, 448.0] | 0.30 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | (1.0, 2.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (59.0, 448.0] | (41.0, 448.0] | 0.30 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | (-0.001, 1.0] | 3 | (8.0, 46.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | (2.0, 21.0] | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (59.0, 448.0] | (41.0, 448.0] | 0.30 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | (-0.001, 1.0] | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (59.0, 448.0] | (41.0, 448.0] | 0.30 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 3 | (4.0, 43.0] | (-99.001, -1.0] | (-99.001, -1.0] | -0.13 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | (-0.001, 1.0] | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 2 | (1.0, 2.0] | (-1.0, 59.0] | (-1.0, 41.0] | 0.26 |
10 rows × 36 columns
nan_check = test_df_WOE_PA023['PA023_bin_WOE'].isna()
nan_values = test_df_WOE_PA023['PA023_bin_WOE'][nan_check]
nan_values
Series([], Name: PA023_bin_WOE, dtype: float64)
k = WOE('PA029')
k
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| PA029 | Count | Good | Bad | Good % | Bad % | PA029_WOE | |
|---|---|---|---|---|---|---|---|
| 2516 | 142.5 | 6 | 0 | 6 | 0.0 | 0.01 | -inf |
| 778 | 45.25 | 3 | 0 | 3 | 0.0 | 0.01 | -inf |
| 1587 | 77.666667 | 6 | 0 | 6 | 0.0 | 0.01 | -inf |
| 1614 | 79.2 | 3 | 0 | 3 | 0.0 | 0.01 | -inf |
| 2988 | 221.5 | 5 | 0 | 5 | 0.0 | 0.01 | -inf |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3583 | 1462.0 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 3584 | 1614.0 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 3585 | 1757.0 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 3586 | 1919.0 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
| 3588 | 2872.0 | 1 | 0 | 1 | 0.0 | 0.00 | NaN |
3590 rows × 7 columns
#Bin the train data
#Convert the 'PA029' column to numeric values, and any non-numeric values (including 'NoData') will be replaced with NaN using the errors='coerce'
train_df['PA029'] = pd.to_numeric(train_df['PA029'], errors='coerce')
train_df['PA029_bin'] = pd.qcut(train_df['PA029'],5,duplicates='drop').values.add_categories("NoData")
train_df['PA029_bin'] = train_df['PA029_bin'].fillna("NoData").astype(str)
train_df['PA029_bin'].value_counts(dropna=False)
(-99.001, -98.0] 43718 (40.0, 2872.0] 12674 (-98.0, 40.0] 7296 NoData 312 Name: PA029_bin, dtype: int64
k = WOE('PA029_bin')
k
| PA029_bin | Count | Good | Bad | Good % | Bad % | PA029_bin_WOE | |
|---|---|---|---|---|---|---|---|
| 1 | (-99.001, -98.0] | 43718 | 7545 | 36173 | 61.15 | 70.02 | -0.14 |
| 0 | (-98.0, 40.0] | 7296 | 1493 | 5803 | 12.10 | 11.23 | 0.07 |
| 3 | NoData | 312 | 79 | 233 | 0.64 | 0.45 | 0.35 |
| 2 | (40.0, 2872.0] | 12674 | 3221 | 9453 | 26.11 | 18.30 | 0.36 |
train_df_WOE_PA029 = pd.merge(train_df, k[['PA029_bin','PA029_bin_WOE']],
left_on='PA029_bin',
right_on='PA029_bin', how='left')
train_df_WOE_PA029.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | CR015_bin | TD001_bin | TD006_bin | TD009_bin | TD010_bin | TD014_bin | PA022_bin | PA023_bin | PA029_bin | PA029_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | 29 | 4 | 2 | 37635 | 5 | 5 | -1.0 | -1.0 | ... | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (4.0, 43.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 1 | 35563 | 1 | 47 | 1 | 2 | 0 | 6 | 12 | 87.0 | 87.0 | ... | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (59.0, 448.0] | (41.0, 448.0] | (-98.0, 40.0] | 0.07 |
| 2 | 4884 | 0 | 31 | 1 | 5 | 47506 | 5 | 12 | -1.0 | -1.0 | ... | (4.0, 5.0] | (1.0, 2.0] | (-0.001, 1.0] | (5.0, 8.0] | (1.0, 2.0] | (1.0, 2.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 3 | 71171 | 0 | 29 | 3 | 4 | 22037 | 6 | 5 | -1.0 | -1.0 | ... | (5.0, 6.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 4 | 25666 | 0 | 35 | 4 | 3 | 67400 | 6 | 7 | -1.0 | -1.0 | ... | (5.0, 6.0] | (1.0, 2.0] | (-0.001, 1.0] | (-0.001, 2.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 5 | 8007 | 0 | 30 | 3 | 2 | 26917 | 5 | 4 | -1.0 | -1.0 | ... | (4.0, 5.0] | (3.0, 20.0] | (2.0, 21.0] | (8.0, 46.0] | (3.0, 35.0] | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 6 | 62227 | 0 | 35 | 1 | 5 | 0 | 6 | 3 | -1.0 | -1.0 | ... | (5.0, 6.0] | (2.0, 3.0] | (-0.001, 1.0] | (5.0, 8.0] | (2.0, 3.0] | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 7 | 12634 | 0 | 25 | 1 | 5 | 0 | 3 | 5 | -1.0 | -1.0 | ... | (1.999, 4.0] | (3.0, 20.0] | (-0.001, 1.0] | (5.0, 8.0] | (-0.001, 1.0] | (-0.001, 1.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 8 | 56100 | 1 | 26 | 3 | 5 | 20799 | 5 | 5 | 12.0 | 12.0 | ... | (4.0, 5.0] | (3.0, 20.0] | (-0.001, 1.0] | (8.0, 46.0] | (-0.001, 1.0] | (1.0, 2.0] | (-1.0, 59.0] | (-1.0, 41.0] | (40.0, 2872.0] | 0.36 |
| 9 | 33174 | 0 | 37 | 1 | 3 | 55000 | 5 | 7 | 69.0 | -1.0 | ... | (4.0, 5.0] | (1.0, 2.0] | (2.0, 21.0] | (4.0, 5.0] | (2.0, 3.0] | (2.0, 4.0] | (59.0, 448.0] | (-99.001, -1.0] | (40.0, 2872.0] | 0.36 |
10 rows × 29 columns
nan_check= train_df_WOE_PA029['PA029_bin_WOE'].isna()
nan_values = train_df_WOE_PA029['PA029_bin_WOE'][nan_check]
nan_values
Series([], Name: PA029_bin_WOE, dtype: float64)
test_df['PA029'] = pd.to_numeric(test_df['PA029'], errors='coerce')
test_df['PA029_bin'] = pd.qcut(test_df['PA029'],5,duplicates='drop').values.add_categories("NoData")
test_df['PA029_bin'] = test_df['PA029_bin'].fillna("NoData").astype(str)
test_df['PA029_bin'].value_counts(dropna=False)
(-99.001, -98.0] 10902 (40.2, 1767.75] 3186 (-98.0, 40.2] 1843 NoData 69 Name: PA029_bin, dtype: int64
# Replace values in a column with another set of values
test_df['PA029_bin'] = test_df['PA029_bin'].replace({"(40.2, 1767.75]": "(40.0, 2872.0]","(-98.0, 40.2]": "(-98.0, 40.0]"})
test_df['PA029_bin'].value_counts(dropna=False)
(-99.001, -98.0] 10902 (40.0, 2872.0] 3186 (-98.0, 40.0] 1843 NoData 69 Name: PA029_bin, dtype: int64
test_df_WOE_PA029 = pd.merge(test_df, k[['PA029_bin', 'PA029_bin_WOE']],
left_on='PA029_bin',
right_on='PA029_bin', how='left')
test_df_WOE_PA029.head(10)
| id | loan_default | AP001 | AP003 | AP008 | CR009 | CR015 | CR019 | PA022 | PA023 | ... | TD009_bin_labels | TD009_bin | TD010_bin_labels | TD010_bin | TD014_bin_labels | TD014_bin | PA022_bin | PA023_bin | PA029_bin | PA029_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 30 | 3 | 3 | 10000 | 5 | 5 | 25.0 | 25.0 | ... | 0 | (2.0, 4.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 59.0] | (-1.0, 41.0] | (-99.001, -98.0] | -0.14 |
| 1 | 44296 | 0 | 33 | 3 | 5 | 27288 | 5 | 5 | -1.0 | -1.0 | ... | 3 | (8.0, 46.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 2 | 74784 | 0 | 29 | 4 | 5 | 33000 | 5 | 11 | 51.0 | 51.0 | ... | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 0 | (-0.001, 1.0] | (-1.0, 59.0] | (41.0, 448.0] | (-98.0, 40.0] | 0.07 |
| 3 | 70976 | 1 | 28 | 1 | 5 | 3000 | 5 | 3 | 85.0 | 85.0 | ... | 0 | (2.0, 4.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (59.0, 448.0] | (41.0, 448.0] | (40.0, 2872.0] | 0.36 |
| 4 | 46646 | 0 | 27 | 1 | 3 | 48219 | 5 | 11 | 58.0 | 58.0 | ... | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (59.0, 448.0] | (41.0, 448.0] | (40.0, 2872.0] | 0.36 |
| 5 | 8216 | 0 | 33 | 4 | 1 | 5000 | 6 | 11 | -1.0 | -1.0 | ... | 3 | (8.0, 46.0] | 2 | (3.0, 35.0] | 2 | (1.0, 2.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 6 | 65510 | 0 | 23 | 3 | 1 | 8100 | 2 | 3 | 75.0 | 75.0 | ... | 4 | (4.0, 5.0] | 3 | (2.0, 3.0] | 3 | (4.0, 43.0] | (59.0, 448.0] | (41.0, 448.0] | (40.0, 2872.0] | 0.36 |
| 7 | 62716 | 0 | 36 | 1 | 3 | 0 | 5 | 3 | 115.0 | 115.0 | ... | 1 | (-0.001, 2.0] | 0 | (-0.001, 1.0] | 1 | (2.0, 4.0] | (59.0, 448.0] | (41.0, 448.0] | (-98.0, 40.0] | 0.07 |
| 8 | 39860 | 0 | 21 | 3 | 3 | 17110 | 5 | 8 | -1.0 | -1.0 | ... | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 3 | (4.0, 43.0] | (-99.001, -1.0] | (-99.001, -1.0] | (-99.001, -98.0] | -0.14 |
| 9 | 58835 | 0 | 24 | 3 | 2 | 60877 | 5 | 10 | 52.0 | 23.0 | ... | 3 | (8.0, 46.0] | 1 | (1.0, 2.0] | 2 | (1.0, 2.0] | (-1.0, 59.0] | (-1.0, 41.0] | (40.0, 2872.0] | 0.36 |
10 rows × 37 columns
nan_check = test_df_WOE_PA029['PA029_bin_WOE'].isna()
nan_values = test_df_WOE_PA029['PA029_bin_WOE'][nan_check]
nan_values
Series([], Name: PA029_bin_WOE, dtype: float64)
column_names = train_df.columns.tolist()
print(column_names)
['id', 'loan_default', 'AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014', 'AP003_bin', 'CR009_bin', 'CR015_bin', 'TD001_bin', 'TD006_bin', 'TD009_bin', 'TD010_bin', 'TD014_bin', 'PA022_bin', 'PA023_bin', 'PA029_bin']
column_names1 = test_df.columns.tolist()
print(column_names1)
['id', 'loan_default', 'AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014', 'AP003_bin_labels', 'AP003_bin', 'CR009_bin_labels', 'CR009_bin', 'CR015_bin_labels', 'CR015_bin', 'TD001_bin_labels', 'TD001_bin', 'TD006_bin_labels', 'TD006_bin', 'TD009_bin_labels', 'TD009_bin', 'TD010_bin_labels', 'TD010_bin', 'TD014_bin_labels', 'TD014_bin', 'PA022_bin', 'PA023_bin', 'PA029_bin']
#train_df.target = train_df['id', 'loan_default']
train_df.target = train_df.drop(columns=train_df.columns.difference(['loan_default']))
test_df.target = test_df.drop(columns=test_df.columns.difference(['loan_default']))
#test_df.target = test_df['id','loan_default']
/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/ipykernel_11511/245799215.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access train_df.target = train_df.drop(columns=train_df.columns.difference(['loan_default'])) /var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/ipykernel_11511/245799215.py:3: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access test_df.target = test_df.drop(columns=test_df.columns.difference(['loan_default']))
test_df.target
| loan_default | |
|---|---|
| 47044 | 0 |
| 44295 | 0 |
| 74783 | 0 |
| 70975 | 1 |
| 46645 | 0 |
| ... | ... |
| 67666 | 0 |
| 51146 | 0 |
| 42494 | 1 |
| 52517 | 0 |
| 7754 | 0 |
16000 rows × 1 columns
train_df.target
| loan_default | |
|---|---|
| 3822 | 0 |
| 35562 | 1 |
| 4883 | 0 |
| 71170 | 0 |
| 25665 | 0 |
| ... | ... |
| 6265 | 0 |
| 54886 | 0 |
| 76820 | 0 |
| 860 | 1 |
| 15795 | 0 |
64000 rows × 1 columns
#train_df.drop('loan_default', 'AP001', 'AP003', 'AP008', 'CR009', 'CR015', 'CR019', 'PA022', 'PA023', 'PA029', 'TD001', 'TD005', 'TD006', 'TD009', 'TD010', 'TD014', 'PA022_bin', 'PA023_bin', 'PA029_bin', 'TD001_bin', 'TD006_bin', 'TD009_bin', 'TD010_bin', 'TD014_bin', 'AP003_bin', 'CR009_bin', 'CR015_bin', axis=1, inplace=True)
train_df_WOE= train_df.drop(columns=train_df.columns.difference(['id']))
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_AP001[['id',"AP001_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_AP003[['id',"AP003_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_AP008[['id',"AP008_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_CR009[['id',"CR009_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_CR015[['id',"CR015_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_CR019[['id',"CR019_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_PA022[['id',"PA022_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_PA023[['id',"PA023_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_PA029[['id',"PA029_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD001[['id',"TD001_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD005[['id',"TD005_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD006[['id',"TD006_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD009[['id',"TD009_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD010[['id',"TD010_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE= pd.merge(train_df_WOE, train_df_WOE_TD014[['id',"TD014_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_WOE
| id | AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | -0.03 | -0.50 | -0.09 | 0.07 | 0.08 | 0.02 | -0.15 | -0.13 | -0.14 | 0.39 | 0.41 | 0.40 | 0.49 | 0.45 | 0.48 |
| 1 | 35563 | -0.04 | 0.07 | -0.09 | -0.09 | -0.27 | -0.22 | 0.28 | 0.30 | 0.07 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| 2 | 4884 | 0.01 | 0.07 | 0.11 | 0.07 | 0.08 | -0.22 | -0.15 | -0.13 | -0.14 | 0.02 | -0.03 | -0.14 | 0.17 | 0.00 | -0.08 |
| 3 | 71171 | -0.03 | 0.07 | 0.09 | 0.07 | -0.27 | 0.02 | -0.15 | -0.13 | -0.14 | 0.39 | 0.59 | 0.40 | 0.49 | 0.16 | 0.14 |
| 4 | 25666 | -0.09 | -0.50 | 0.02 | -0.14 | -0.27 | -0.01 | -0.15 | -0.13 | -0.14 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 63995 | 6266 | 0.04 | 0.07 | 0.02 | 0.07 | 0.08 | 0.12 | -0.15 | -0.13 | -0.14 | 0.39 | 0.04 | -0.14 | 0.04 | -0.24 | -0.08 |
| 63996 | 54887 | 0.01 | 0.07 | 0.09 | -0.14 | -0.27 | 0.02 | 0.28 | -0.13 | 0.07 | 0.02 | 0.04 | -0.14 | 0.04 | -0.24 | -0.30 |
| 63997 | 76821 | 0.04 | 0.07 | -0.09 | 0.07 | 0.08 | 0.12 | -0.15 | -0.13 | -0.14 | 0.02 | 0.69 | 0.40 | 0.49 | 0.16 | 0.14 |
| 63998 | 861 | 0.04 | 0.07 | 0.11 | -0.14 | -0.27 | -0.20 | -0.15 | -0.13 | -0.14 | -0.24 | -0.22 | 0.11 | 0.17 | 0.00 | -0.08 |
| 63999 | 15796 | 0.10 | 0.07 | 0.09 | 0.08 | 0.08 | 0.14 | -0.15 | -0.13 | -0.14 | -0.24 | -0.51 | -0.14 | -0.18 | -0.24 | -0.30 |
64000 rows × 16 columns
test_df_WOE= test_df.drop(columns=test_df.columns.difference(['id']))
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_AP001[['id', 'AP001_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_AP003[['id', 'AP003_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_AP008[['id', 'AP008_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_CR009[['id', 'CR009_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_CR015[['id', 'CR015_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_CR019[['id', 'CR019_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_PA022[['id', 'PA022_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_PA023[['id', 'PA023_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_PA029[['id', 'PA029_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD001[['id', 'TD001_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD005[['id', 'TD005_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD006[['id', 'TD006_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD009[['id', 'TD009_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD010[['id', 'TD010_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE = pd.merge(test_df_WOE, test_df_WOE_TD014[['id', 'TD014_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_WOE
| id | AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0.04 | 0.07 | 0.02 | -0.09 | -0.27 | 0.02 | 0.22 | 0.26 | -0.14 | 0.02 | -0.22 | -0.14 | -0.18 | -0.24 | -0.30 |
| 1 | 44296 | -0.04 | 0.07 | 0.11 | -0.14 | -0.27 | 0.02 | -0.15 | -0.13 | -0.14 | 0.02 | 0.04 | -0.14 | 0.49 | -0.24 | 0.14 |
| 2 | 74784 | -0.03 | -0.50 | 0.11 | -0.14 | -0.27 | -0.20 | 0.22 | 0.30 | 0.07 | -0.24 | -0.03 | -0.14 | -0.49 | -0.24 | -0.30 |
| 3 | 70976 | 0.04 | 0.07 | 0.11 | -0.09 | -0.27 | 0.12 | 0.28 | 0.30 | 0.36 | -0.24 | -0.51 | 0.40 | -0.18 | 0.45 | -0.08 |
| 4 | 46646 | 0.10 | 0.07 | 0.02 | -0.14 | -0.27 | -0.20 | 0.28 | 0.30 | 0.36 | 0.12 | 0.39 | 0.11 | 0.04 | 0.16 | 0.48 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15995 | 67667 | -0.08 | 0.07 | 0.11 | -0.14 | 0.19 | -0.20 | 0.22 | 0.30 | 0.07 | 0.02 | -0.03 | 0.11 | -0.49 | 0.16 | -0.08 |
| 15996 | 51147 | -0.10 | 0.07 | -0.09 | -0.14 | 0.19 | 0.14 | 0.28 | 0.30 | 0.36 | 0.12 | 0.58 | 0.40 | 0.04 | 0.16 | 0.48 |
| 15997 | 42495 | 0.01 | 0.07 | -0.09 | 0.07 | -0.27 | 0.12 | -0.15 | -0.13 | -0.14 | 0.39 | -0.03 | -0.14 | -0.49 | -0.24 | 0.14 |
| 15998 | 52518 | -0.03 | 0.07 | -0.20 | -0.09 | 0.08 | 0.14 | -0.15 | -0.13 | -0.14 | 0.39 | -0.03 | -0.14 | -0.49 | -0.24 | 0.14 |
| 15999 | 7755 | -0.07 | 0.07 | -0.09 | 0.08 | 0.19 | -0.06 | -0.15 | -0.13 | -0.14 | 0.02 | 0.23 | -0.14 | 0.04 | 0.45 | 0.48 |
16000 rows × 16 columns
column_names = train_df_WOE.columns.tolist()
print(column_names)
['id', 'AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
column_names = train_df_WOE.columns.tolist()
print(column_names)
['id', 'AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
test_df_WOE_withoutid = test_df_WOE.drop("id", axis=1)
test_df_WOE_withoutid
| AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.04 | 0.07 | 0.02 | -0.09 | -0.27 | 0.02 | 0.22 | 0.26 | -0.14 | 0.02 | -0.22 | -0.14 | -0.18 | -0.24 | -0.30 |
| 1 | -0.04 | 0.07 | 0.11 | -0.14 | -0.27 | 0.02 | -0.15 | -0.13 | -0.14 | 0.02 | 0.04 | -0.14 | 0.49 | -0.24 | 0.14 |
| 2 | -0.03 | -0.50 | 0.11 | -0.14 | -0.27 | -0.20 | 0.22 | 0.30 | 0.07 | -0.24 | -0.03 | -0.14 | -0.49 | -0.24 | -0.30 |
| 3 | 0.04 | 0.07 | 0.11 | -0.09 | -0.27 | 0.12 | 0.28 | 0.30 | 0.36 | -0.24 | -0.51 | 0.40 | -0.18 | 0.45 | -0.08 |
| 4 | 0.10 | 0.07 | 0.02 | -0.14 | -0.27 | -0.20 | 0.28 | 0.30 | 0.36 | 0.12 | 0.39 | 0.11 | 0.04 | 0.16 | 0.48 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15995 | -0.08 | 0.07 | 0.11 | -0.14 | 0.19 | -0.20 | 0.22 | 0.30 | 0.07 | 0.02 | -0.03 | 0.11 | -0.49 | 0.16 | -0.08 |
| 15996 | -0.10 | 0.07 | -0.09 | -0.14 | 0.19 | 0.14 | 0.28 | 0.30 | 0.36 | 0.12 | 0.58 | 0.40 | 0.04 | 0.16 | 0.48 |
| 15997 | 0.01 | 0.07 | -0.09 | 0.07 | -0.27 | 0.12 | -0.15 | -0.13 | -0.14 | 0.39 | -0.03 | -0.14 | -0.49 | -0.24 | 0.14 |
| 15998 | -0.03 | 0.07 | -0.20 | -0.09 | 0.08 | 0.14 | -0.15 | -0.13 | -0.14 | 0.39 | -0.03 | -0.14 | -0.49 | -0.24 | 0.14 |
| 15999 | -0.07 | 0.07 | -0.09 | 0.08 | 0.19 | -0.06 | -0.15 | -0.13 | -0.14 | 0.02 | 0.23 | -0.14 | 0.04 | 0.45 | 0.48 |
16000 rows × 15 columns
train_df_WOE_withoutid = train_df_WOE.drop("id", axis=1)
train_df_WOE_withoutid
| AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.03 | -0.50 | -0.09 | 0.07 | 0.08 | 0.02 | -0.15 | -0.13 | -0.14 | 0.39 | 0.41 | 0.40 | 0.49 | 0.45 | 0.48 |
| 1 | -0.04 | 0.07 | -0.09 | -0.09 | -0.27 | -0.22 | 0.28 | 0.30 | 0.07 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| 2 | 0.01 | 0.07 | 0.11 | 0.07 | 0.08 | -0.22 | -0.15 | -0.13 | -0.14 | 0.02 | -0.03 | -0.14 | 0.17 | 0.00 | -0.08 |
| 3 | -0.03 | 0.07 | 0.09 | 0.07 | -0.27 | 0.02 | -0.15 | -0.13 | -0.14 | 0.39 | 0.59 | 0.40 | 0.49 | 0.16 | 0.14 |
| 4 | -0.09 | -0.50 | 0.02 | -0.14 | -0.27 | -0.01 | -0.15 | -0.13 | -0.14 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 63995 | 0.04 | 0.07 | 0.02 | 0.07 | 0.08 | 0.12 | -0.15 | -0.13 | -0.14 | 0.39 | 0.04 | -0.14 | 0.04 | -0.24 | -0.08 |
| 63996 | 0.01 | 0.07 | 0.09 | -0.14 | -0.27 | 0.02 | 0.28 | -0.13 | 0.07 | 0.02 | 0.04 | -0.14 | 0.04 | -0.24 | -0.30 |
| 63997 | 0.04 | 0.07 | -0.09 | 0.07 | 0.08 | 0.12 | -0.15 | -0.13 | -0.14 | 0.02 | 0.69 | 0.40 | 0.49 | 0.16 | 0.14 |
| 63998 | 0.04 | 0.07 | 0.11 | -0.14 | -0.27 | -0.20 | -0.15 | -0.13 | -0.14 | -0.24 | -0.22 | 0.11 | 0.17 | 0.00 | -0.08 |
| 63999 | 0.10 | 0.07 | 0.09 | 0.08 | 0.08 | 0.14 | -0.15 | -0.13 | -0.14 | -0.24 | -0.51 | -0.14 | -0.18 | -0.24 | -0.30 |
64000 rows × 15 columns
train_df_rf= train_df.drop(columns=train_df.columns.difference(['id', 'loan_default']))
train_df_rf= pd.merge(train_df_rf, train_df_WOE_AP001[['id',"AP001_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_AP003[['id',"AP003_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_AP008[['id',"AP008_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_CR009[['id',"CR009_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_CR015[['id',"CR015_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_CR019[['id',"CR019_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_PA022[['id',"PA022_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_PA023[['id',"PA023_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_PA029[['id',"PA029_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD001[['id',"TD001_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf =pd.merge(train_df_rf, train_df_WOE_TD005[['id',"TD005_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD006[['id',"TD006_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD009[['id',"TD009_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD010[['id',"TD010_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf= pd.merge(train_df_rf, train_df_WOE_TD014[['id',"TD014_bin_WOE"]],
left_on='id',
right_on='id', how='left')
train_df_rf
| id | loan_default | AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3823 | 0 | -0.03 | -0.50 | -0.09 | 0.07 | 0.08 | 0.02 | -0.15 | -0.13 | -0.14 | 0.39 | 0.41 | 0.40 | 0.49 | 0.45 | 0.48 |
| 1 | 35563 | 1 | -0.04 | 0.07 | -0.09 | -0.09 | -0.27 | -0.22 | 0.28 | 0.30 | 0.07 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| 2 | 4884 | 0 | 0.01 | 0.07 | 0.11 | 0.07 | 0.08 | -0.22 | -0.15 | -0.13 | -0.14 | 0.02 | -0.03 | -0.14 | 0.17 | 0.00 | -0.08 |
| 3 | 71171 | 0 | -0.03 | 0.07 | 0.09 | 0.07 | -0.27 | 0.02 | -0.15 | -0.13 | -0.14 | 0.39 | 0.59 | 0.40 | 0.49 | 0.16 | 0.14 |
| 4 | 25666 | 0 | -0.09 | -0.50 | 0.02 | -0.14 | -0.27 | -0.01 | -0.15 | -0.13 | -0.14 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 63995 | 6266 | 0 | 0.04 | 0.07 | 0.02 | 0.07 | 0.08 | 0.12 | -0.15 | -0.13 | -0.14 | 0.39 | 0.04 | -0.14 | 0.04 | -0.24 | -0.08 |
| 63996 | 54887 | 0 | 0.01 | 0.07 | 0.09 | -0.14 | -0.27 | 0.02 | 0.28 | -0.13 | 0.07 | 0.02 | 0.04 | -0.14 | 0.04 | -0.24 | -0.30 |
| 63997 | 76821 | 0 | 0.04 | 0.07 | -0.09 | 0.07 | 0.08 | 0.12 | -0.15 | -0.13 | -0.14 | 0.02 | 0.69 | 0.40 | 0.49 | 0.16 | 0.14 |
| 63998 | 861 | 1 | 0.04 | 0.07 | 0.11 | -0.14 | -0.27 | -0.20 | -0.15 | -0.13 | -0.14 | -0.24 | -0.22 | 0.11 | 0.17 | 0.00 | -0.08 |
| 63999 | 15796 | 0 | 0.10 | 0.07 | 0.09 | 0.08 | 0.08 | 0.14 | -0.15 | -0.13 | -0.14 | -0.24 | -0.51 | -0.14 | -0.18 | -0.24 | -0.30 |
64000 rows × 17 columns
test_df_rf= test_df.drop(columns=test_df.columns.difference(['id','loan_default']))
test_df_rf = pd.merge(test_df_rf, test_df_WOE_AP001[['id', 'AP001_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_AP003[['id', 'AP003_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_AP008[['id', 'AP008_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_CR009[['id', 'CR009_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_CR015[['id', 'CR015_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_CR019[['id', 'CR019_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_PA022[['id', 'PA022_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_PA023[['id', 'PA023_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_PA029[['id', 'PA029_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD001[['id', 'TD001_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD005[['id', 'TD005_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD006[['id', 'TD006_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD009[['id', 'TD009_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD010[['id', 'TD010_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf = pd.merge(test_df_rf, test_df_WOE_TD014[['id', 'TD014_bin_WOE']],
left_on='id',
right_on='id', how='left')
test_df_rf
| id | loan_default | AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 47045 | 0 | 0.04 | 0.07 | 0.02 | -0.09 | -0.27 | 0.02 | 0.22 | 0.26 | -0.14 | 0.02 | -0.22 | -0.14 | -0.18 | -0.24 | -0.30 |
| 1 | 44296 | 0 | -0.04 | 0.07 | 0.11 | -0.14 | -0.27 | 0.02 | -0.15 | -0.13 | -0.14 | 0.02 | 0.04 | -0.14 | 0.49 | -0.24 | 0.14 |
| 2 | 74784 | 0 | -0.03 | -0.50 | 0.11 | -0.14 | -0.27 | -0.20 | 0.22 | 0.30 | 0.07 | -0.24 | -0.03 | -0.14 | -0.49 | -0.24 | -0.30 |
| 3 | 70976 | 1 | 0.04 | 0.07 | 0.11 | -0.09 | -0.27 | 0.12 | 0.28 | 0.30 | 0.36 | -0.24 | -0.51 | 0.40 | -0.18 | 0.45 | -0.08 |
| 4 | 46646 | 0 | 0.10 | 0.07 | 0.02 | -0.14 | -0.27 | -0.20 | 0.28 | 0.30 | 0.36 | 0.12 | 0.39 | 0.11 | 0.04 | 0.16 | 0.48 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 15995 | 67667 | 0 | -0.08 | 0.07 | 0.11 | -0.14 | 0.19 | -0.20 | 0.22 | 0.30 | 0.07 | 0.02 | -0.03 | 0.11 | -0.49 | 0.16 | -0.08 |
| 15996 | 51147 | 0 | -0.10 | 0.07 | -0.09 | -0.14 | 0.19 | 0.14 | 0.28 | 0.30 | 0.36 | 0.12 | 0.58 | 0.40 | 0.04 | 0.16 | 0.48 |
| 15997 | 42495 | 1 | 0.01 | 0.07 | -0.09 | 0.07 | -0.27 | 0.12 | -0.15 | -0.13 | -0.14 | 0.39 | -0.03 | -0.14 | -0.49 | -0.24 | 0.14 |
| 15998 | 52518 | 0 | -0.03 | 0.07 | -0.20 | -0.09 | 0.08 | 0.14 | -0.15 | -0.13 | -0.14 | 0.39 | -0.03 | -0.14 | -0.49 | -0.24 | 0.14 |
| 15999 | 7755 | 0 | -0.07 | 0.07 | -0.09 | 0.08 | 0.19 | -0.06 | -0.15 | -0.13 | -0.14 | 0.02 | 0.23 | -0.14 | 0.04 | 0.45 | 0.48 |
16000 rows × 17 columns
import numpy as np
import datetime
import pandas as pd
import matplotlib.pyplot as plt
#import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from matplotlib.rcsetup import validate_aspect
#Use WOE transformed features to run model
#train_df_WOE & test_df_WOE
train_df_rf.shape
(64000, 17)
test_df_rf.shape
(16000, 17)
var = pd.DataFrame(train_df_rf.dtypes)
var
| 0 | |
|---|---|
| id | int64 |
| loan_default | int64 |
| AP001_WOE | float64 |
| AP003_bin_WOE | float64 |
| AP008_WOE | float64 |
| CR009_bin_WOE | float64 |
| CR015_bin_WOE | float64 |
| CR019_WOE | float64 |
| PA022_bin_WOE | float64 |
| PA023_bin_WOE | float64 |
| PA029_bin_WOE | float64 |
| TD001_bin_WOE | float64 |
| TD005_WOE | float64 |
| TD006_bin_WOE | float64 |
| TD009_bin_WOE | float64 |
| TD010_bin_WOE | float64 |
| TD014_bin_WOE | float64 |
pip install h2o
Requirement already satisfied: h2o in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (3.42.0.1) Requirement already satisfied: requests in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from h2o) (2.28.2) Requirement already satisfied: tabulate in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from h2o) (0.9.0) Requirement already satisfied: charset-normalizer<4,>=2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (3.1.0) Requirement already satisfied: idna<4,>=2.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (3.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from requests->h2o) (2022.12.7) [notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
import h2o
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
| H2O_cluster_uptime: | 8 hours 9 mins |
| H2O_cluster_timezone: | Asia/Taipei |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.42.0.1 |
| H2O_cluster_version_age: | 1 month and 27 days |
| H2O_cluster_name: | H2O_from_python_yientseng_hm4qux |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 1.469 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.11.1 final |
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
target='loan_default'
train_smpl = train_df_rf.sample(frac=0.1, random_state=1)
test_smpl = test_df_rf.sample(frac=0.1, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
predictors = train_df_rf.columns.tolist()
predictors=predictors[2:17]
predictors
['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
rf_v1 = H2ORandomForestEstimator(
model_id = 'rf_v1',
ntrees = 300,
nfolds=10,
min_rows=100,
seed=1234)
rf_v1.train(predictors,target,training_frame=train_hex)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: rf_v1
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 300.0 | 300.0 | 125312.0 | 7.0 | 12.0 | 8.963333 | 24.0 | 32.0 | 28.423334 |
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 0.14951065232172528 RMSE: 0.3866660734040747 MAE: 0.2992474803933356 RMSLE: 0.2712098866294993 Mean Residual Deviance: 0.14951065232172528
ModelMetricsRegression: drf ** Reported on cross-validation data. ** MSE: 0.14963480040402838 RMSE: 0.3868265766516416 MAE: 0.299352536320045 RMSLE: 0.27131915462305656 Mean Residual Deviance: 0.14963480040402838
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | cv_6_valid | cv_7_valid | cv_8_valid | cv_9_valid | cv_10_valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mae | 0.2993906 | 0.0082924 | 0.3132309 | 0.2944414 | 0.3019102 | 0.2925448 | 0.3000576 | 0.3052748 | 0.2901710 | 0.3090672 | 0.2877329 | 0.2994755 |
| mean_residual_deviance | 0.1496330 | 0.0094428 | 0.1652384 | 0.1433392 | 0.1545313 | 0.1437271 | 0.1504920 | 0.1577827 | 0.1377648 | 0.1591601 | 0.1371658 | 0.1471285 |
| mse | 0.1496330 | 0.0094428 | 0.1652384 | 0.1433392 | 0.1545313 | 0.1437271 | 0.1504920 | 0.1577827 | 0.1377648 | 0.1591601 | 0.1371658 | 0.1471285 |
| r2 | 0.0231276 | 0.0151769 | 0.0032182 | 0.0241438 | 0.0142291 | 0.0122233 | 0.0187475 | 0.0222429 | 0.0333000 | 0.0565313 | 0.0124066 | 0.0342336 |
| residual_deviance | 0.1496330 | 0.0094428 | 0.1652384 | 0.1433392 | 0.1545313 | 0.1437271 | 0.1504920 | 0.1577827 | 0.1377648 | 0.1591601 | 0.1371658 | 0.1471285 |
| rmse | 0.3866515 | 0.0121850 | 0.4064952 | 0.3786016 | 0.3931047 | 0.3791136 | 0.3879329 | 0.3972187 | 0.3711669 | 0.3989487 | 0.3703589 | 0.3835733 |
| rmsle | 0.2712443 | 0.0065110 | 0.2830140 | 0.2670296 | 0.2748869 | 0.2674065 | 0.2722500 | 0.2767344 | 0.2627884 | 0.2759479 | 0.2629195 | 0.2694657 |
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 23:24:29 | 1 min 3.505 sec | 0.0 | nan | nan | nan | |
| 2023-07-26 23:24:29 | 1 min 3.573 sec | 1.0 | 0.3875406 | 0.2971530 | 0.1501877 | |
| 2023-07-26 23:24:30 | 1 min 3.615 sec | 2.0 | 0.3890889 | 0.3008566 | 0.1513902 | |
| 2023-07-26 23:24:30 | 1 min 3.630 sec | 3.0 | 0.3882861 | 0.2995615 | 0.1507661 | |
| 2023-07-26 23:24:30 | 1 min 3.645 sec | 4.0 | 0.3869131 | 0.2988633 | 0.1497017 | |
| 2023-07-26 23:24:30 | 1 min 3.660 sec | 5.0 | 0.3873920 | 0.3000798 | 0.1500726 | |
| 2023-07-26 23:24:30 | 1 min 3.674 sec | 6.0 | 0.3873828 | 0.2999194 | 0.1500654 | |
| 2023-07-26 23:24:30 | 1 min 3.690 sec | 7.0 | 0.3879887 | 0.3004147 | 0.1505352 | |
| 2023-07-26 23:24:30 | 1 min 3.706 sec | 8.0 | 0.3881241 | 0.3000204 | 0.1506403 | |
| 2023-07-26 23:24:30 | 1 min 3.729 sec | 9.0 | 0.3876442 | 0.3000743 | 0.1502680 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 23:24:33 | 1 min 7.048 sec | 291.0 | 0.3866741 | 0.2992291 | 0.1495169 | |
| 2023-07-26 23:24:33 | 1 min 7.060 sec | 292.0 | 0.3866675 | 0.2992232 | 0.1495117 | |
| 2023-07-26 23:24:33 | 1 min 7.069 sec | 293.0 | 0.3866678 | 0.2992219 | 0.1495120 | |
| 2023-07-26 23:24:33 | 1 min 7.079 sec | 294.0 | 0.3866633 | 0.2992180 | 0.1495085 | |
| 2023-07-26 23:24:33 | 1 min 7.092 sec | 295.0 | 0.3866660 | 0.2992240 | 0.1495106 | |
| 2023-07-26 23:24:33 | 1 min 7.103 sec | 296.0 | 0.3866689 | 0.2992322 | 0.1495128 | |
| 2023-07-26 23:24:33 | 1 min 7.111 sec | 297.0 | 0.3866656 | 0.2992397 | 0.1495103 | |
| 2023-07-26 23:24:33 | 1 min 7.122 sec | 298.0 | 0.3866640 | 0.2992462 | 0.1495091 | |
| 2023-07-26 23:24:33 | 1 min 7.130 sec | 299.0 | 0.3866674 | 0.2992554 | 0.1495117 | |
| 2023-07-26 23:24:33 | 1 min 7.139 sec | 300.0 | 0.3866661 | 0.2992475 | 0.1495107 |
[301 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 2103.5937500 | 1.0 | 0.2383542 |
| TD005_WOE | 1458.4207764 | 0.6932996 | 0.1652509 |
| PA029_bin_WOE | 1220.0726318 | 0.5799944 | 0.1382441 |
| TD014_bin_WOE | 594.2836304 | 0.2825087 | 0.0673371 |
| CR019_WOE | 551.7902832 | 0.2623084 | 0.0625223 |
| CR015_bin_WOE | 534.6726685 | 0.2541711 | 0.0605827 |
| PA023_bin_WOE | 413.6450500 | 0.1966373 | 0.0468693 |
| AP003_bin_WOE | 331.2231140 | 0.1574558 | 0.0375303 |
| AP001_WOE | 304.8792114 | 0.1449326 | 0.0345453 |
| AP008_WOE | 283.7080994 | 0.1348683 | 0.0321464 |
| TD010_bin_WOE | 267.6243286 | 0.1272224 | 0.0303240 |
| TD001_bin_WOE | 251.5751953 | 0.1195931 | 0.0285055 |
| CR009_bin_WOE | 246.8613586 | 0.1173522 | 0.0279714 |
| PA022_bin_WOE | 185.5974274 | 0.0882287 | 0.0210297 |
| TD006_bin_WOE | 77.5485992 | 0.0368648 | 0.0087869 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
def VarImp(model_name):
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# plot the variable importance
plt.rcdefaults()
variables = model_name._model_json['output']['variable_importances']['variable']
y_pos = np.arange(len(variables))
fig, ax = plt.subplots(figsize = (6,len(variables)/2))
scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
ax.barh(y_pos,scaled_importance,align='center',color='green')
ax.set_yticks(y_pos)
ax.set_yticklabels(variables)
ax.invert_yaxis()
ax.set_xlabel('Scaled Importance')
ax.set_title('Variable Importance')
plt.show()
VarImp(rf_v1)
predictions = rf_v1.predict(test_hex)
predictions.head()
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| loan_default | predict | |
|---|---|---|
| 0 | 0 | 0.278723 |
| 1 | 0 | 0.257221 |
| 2 | 0 | 0.209367 |
| 3 | 0 | 0.153741 |
| 4 | 0 | 0.215133 |
def createGains(model):
predictions = model.predict(test_hex)
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
#sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
test_scores = test_scores.sort_values(by='predict',ascending=False)
test_scores['row_id'] = range(0,0+len(test_scores))
test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
#see count by decile
test_scores.loc[test_scores['decile'] == 10]=9
test_scores['decile'].value_counts()
#create gains table
gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
#add features to gains table
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
gains = pd.DataFrame(gains)
return(gains)
createGains(rf_v1)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 160 | 47 | 113 | 160 | 47 | 113 | 0.16 | 0.09 | 30.0 | 1.57 | 7.0 | 29.38 |
| 1 | 160 | 42 | 118 | 320 | 89 | 231 | 0.30 | 0.18 | 60.0 | 1.48 | 12.0 | 27.81 |
| 2 | 160 | 41 | 119 | 480 | 130 | 350 | 0.43 | 0.27 | 90.0 | 1.44 | 16.0 | 27.08 |
| 3 | 160 | 35 | 125 | 640 | 165 | 475 | 0.55 | 0.37 | 120.0 | 1.38 | 18.0 | 25.78 |
| 4 | 160 | 31 | 129 | 800 | 196 | 604 | 0.65 | 0.46 | 150.0 | 1.31 | 19.0 | 24.50 |
| 5 | 160 | 24 | 136 | 960 | 220 | 740 | 0.73 | 0.57 | 180.0 | 1.22 | 16.0 | 22.92 |
| 6 | 160 | 23 | 137 | 1120 | 243 | 877 | 0.81 | 0.67 | 210.0 | 1.16 | 14.0 | 21.70 |
| 7 | 160 | 16 | 144 | 1280 | 259 | 1021 | 0.86 | 0.79 | 240.0 | 1.08 | 7.0 | 20.23 |
| 8 | 160 | 23 | 137 | 1440 | 282 | 1158 | 0.94 | 0.89 | 270.0 | 1.04 | 5.0 | 19.58 |
| 9 | 160 | 18 | 142 | 1600 | 300 | 1300 | 1.00 | 1.00 | 300.0 | 1.00 | 0.0 | 18.75 |
def ROC_AUC(my_result,df,target):
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# ROC
y_actual = df[target].as_data_frame()
y_pred = my_result.predict(df).as_data_frame()
fpr = list()
tpr = list()
roc_auc = list()
fpr,tpr,_ = roc_curve(y_actual,y_pred)
roc_auc = auc(fpr,tpr)
# Precision-Recall
average_precision = average_precision_score(y_actual,y_pred)
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# plotting
plt.figure(figsize=(10,4))
# ROC
plt.subplot(1,2,1)
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
precision,recall,_ = precision_recall_curve(y_actual,y_pred)
plt.step(recall,precision,color='b',alpha=0.2,where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.xlim([0.0,1.0])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
plt.show()
ROC_AUC(rf_v1,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
Turn out it doesn't perform better with entire training dataset. One possible reason is smaller datasets are less likely to lead to overfitting as they force the model to generalize better.
train_hex = h2o.H2OFrame(train_df_rf)
test_hex = h2o.H2OFrame(test_df_rf)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
rf_v2 = H2ORandomForestEstimator(
model_id = 'rf_v2',
ntrees = 300,
nfolds=10,
min_rows=100,
seed=1234)
rf_v2.train(predictors,target,training_frame=train_hex)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: rf_v2
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 300.0 | 300.0 | 1073835.0 | 13.0 | 20.0 | 17.026667 | 263.0 | 297.0 | 280.53665 |
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 0.1498718771489339 RMSE: 0.38713289339570967 MAE: 0.30003284286595944 RMSLE: 0.2715420816428299 Mean Residual Deviance: 0.1498718771489339
ModelMetricsRegression: drf ** Reported on cross-validation data. ** MSE: 0.1499180749524892 RMSE: 0.3871925553939399 MAE: 0.30018390703180164 RMSLE: 0.271583976669591 Mean Residual Deviance: 0.1499180749524892
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | cv_6_valid | cv_7_valid | cv_8_valid | cv_9_valid | cv_10_valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mae | 0.3001959 | 0.0035226 | 0.3043954 | 0.3046928 | 0.3026648 | 0.297634 | 0.2955310 | 0.2948277 | 0.3015419 | 0.2978897 | 0.3019386 | 0.3008434 |
| mean_residual_deviance | 0.1499312 | 0.0035814 | 0.1541525 | 0.1549230 | 0.1515166 | 0.1477617 | 0.1460344 | 0.1438500 | 0.1517742 | 0.1474197 | 0.1518276 | 0.1500522 |
| mse | 0.1499312 | 0.0035814 | 0.1541525 | 0.1549230 | 0.1515166 | 0.1477617 | 0.1460344 | 0.1438500 | 0.1517742 | 0.1474197 | 0.1518276 | 0.1500522 |
| r2 | 0.0363812 | 0.0029430 | 0.0362065 | 0.0338884 | 0.0395042 | 0.0345291 | 0.0313030 | 0.0379271 | 0.0411318 | 0.0376920 | 0.0341692 | 0.0374612 |
| residual_deviance | 0.1499312 | 0.0035814 | 0.1541525 | 0.1549230 | 0.1515166 | 0.1477617 | 0.1460344 | 0.1438500 | 0.1517742 | 0.1474197 | 0.1518276 | 0.1500522 |
| rmse | 0.3871846 | 0.0046316 | 0.3926226 | 0.3936026 | 0.3892513 | 0.3843979 | 0.3821444 | 0.3792757 | 0.3895821 | 0.3839527 | 0.3896506 | 0.3873657 |
| rmsle | 0.2715834 | 0.0024911 | 0.2745020 | 0.2751261 | 0.2727129 | 0.2700233 | 0.2689685 | 0.2673593 | 0.2726018 | 0.2697902 | 0.2730406 | 0.2717092 |
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 23:30:18 | 5 min 39.854 sec | 0.0 | nan | nan | nan | |
| 2023-07-26 23:30:18 | 5 min 39.953 sec | 1.0 | 0.3899816 | 0.2991295 | 0.1520856 | |
| 2023-07-26 23:30:18 | 5 min 40.058 sec | 2.0 | 0.3913598 | 0.3006538 | 0.1531625 | |
| 2023-07-26 23:30:18 | 5 min 40.148 sec | 3.0 | 0.3896600 | 0.3000131 | 0.1518349 | |
| 2023-07-26 23:30:18 | 5 min 40.247 sec | 4.0 | 0.3890223 | 0.2999989 | 0.1513384 | |
| 2023-07-26 23:30:18 | 5 min 40.352 sec | 5.0 | 0.3891703 | 0.3000873 | 0.1514535 | |
| 2023-07-26 23:30:18 | 5 min 40.453 sec | 6.0 | 0.3886908 | 0.3002141 | 0.1510805 | |
| 2023-07-26 23:30:18 | 5 min 40.550 sec | 7.0 | 0.3884296 | 0.3002264 | 0.1508775 | |
| 2023-07-26 23:30:18 | 5 min 40.649 sec | 8.0 | 0.3882854 | 0.3001559 | 0.1507655 | |
| 2023-07-26 23:30:19 | 5 min 40.748 sec | 9.0 | 0.3880043 | 0.3000113 | 0.1505474 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 23:30:21 | 5 min 43.472 sec | 35.0 | 0.3872759 | 0.3002579 | 0.1499827 | |
| 2023-07-26 23:30:21 | 5 min 43.643 sec | 36.0 | 0.3872822 | 0.3002692 | 0.1499875 | |
| 2023-07-26 23:30:22 | 5 min 43.811 sec | 37.0 | 0.3872803 | 0.3002645 | 0.1499861 | |
| 2023-07-26 23:30:26 | 5 min 47.869 sec | 76.0 | 0.3871462 | 0.3001187 | 0.1498822 | |
| 2023-07-26 23:30:30 | 5 min 51.936 sec | 120.0 | 0.3871379 | 0.3000850 | 0.1498757 | |
| 2023-07-26 23:30:34 | 5 min 56.004 sec | 160.0 | 0.3871235 | 0.3000569 | 0.1498646 | |
| 2023-07-26 23:30:38 | 6 min 0.010 sec | 200.0 | 0.3871409 | 0.3000359 | 0.1498781 | |
| 2023-07-26 23:30:42 | 6 min 4.102 sec | 244.0 | 0.3871412 | 0.3000565 | 0.1498783 | |
| 2023-07-26 23:30:46 | 6 min 8.123 sec | 282.0 | 0.3871323 | 0.3000268 | 0.1498714 | |
| 2023-07-26 23:30:48 | 6 min 9.983 sec | 300.0 | 0.3871329 | 0.3000328 | 0.1498719 |
[45 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 21400.7890625 | 1.0 | 0.1963187 |
| TD005_WOE | 18484.5566406 | 0.8637325 | 0.1695668 |
| TD014_bin_WOE | 10382.4628906 | 0.4851439 | 0.0952428 |
| AP003_bin_WOE | 9848.0214844 | 0.4601710 | 0.0903402 |
| CR015_bin_WOE | 8241.4550781 | 0.3851005 | 0.0756024 |
| AP008_WOE | 5805.2836914 | 0.2712649 | 0.0532544 |
| CR019_WOE | 5420.5742188 | 0.2532885 | 0.0497253 |
| PA029_bin_WOE | 5250.4301758 | 0.2453382 | 0.0481645 |
| TD010_bin_WOE | 4775.7114258 | 0.2231559 | 0.0438097 |
| PA022_bin_WOE | 4665.4946289 | 0.2180057 | 0.0427986 |
| AP001_WOE | 4154.1879883 | 0.1941138 | 0.0381082 |
| TD001_bin_WOE | 3647.7722168 | 0.1704504 | 0.0334626 |
| PA023_bin_WOE | 3111.0019531 | 0.1453686 | 0.0285386 |
| CR009_bin_WOE | 2362.7492676 | 0.1104048 | 0.0216745 |
| TD006_bin_WOE | 1459.9486084 | 0.0682194 | 0.0133927 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
ROC_AUC(rf_v2,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
createGains(rf_v2)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 1600 | 509 | 1091 | 1600 | 509 | 1091 | 0.16 | 0.08 | 315.0 | 1.62 | 8.0 | 31.81 |
| 1 | 1600 | 440 | 1160 | 3200 | 949 | 2251 | 0.30 | 0.18 | 630.0 | 1.51 | 12.0 | 29.66 |
| 2 | 1600 | 362 | 1238 | 4800 | 1311 | 3489 | 0.42 | 0.27 | 945.0 | 1.39 | 15.0 | 27.31 |
| 3 | 1600 | 368 | 1232 | 6400 | 1679 | 4721 | 0.53 | 0.37 | 1260.0 | 1.33 | 16.0 | 26.23 |
| 4 | 1600 | 326 | 1274 | 8000 | 2005 | 5995 | 0.64 | 0.47 | 1575.0 | 1.27 | 17.0 | 25.06 |
| 5 | 1600 | 237 | 1363 | 9600 | 2242 | 7358 | 0.71 | 0.57 | 1890.0 | 1.19 | 14.0 | 23.35 |
| 6 | 1600 | 239 | 1361 | 11200 | 2481 | 8719 | 0.79 | 0.68 | 2205.0 | 1.13 | 11.0 | 22.15 |
| 7 | 1600 | 263 | 1337 | 12800 | 2744 | 10056 | 0.87 | 0.78 | 2520.0 | 1.09 | 9.0 | 21.44 |
| 8 | 1600 | 233 | 1367 | 14400 | 2977 | 11423 | 0.95 | 0.89 | 2835.0 | 1.05 | 6.0 | 20.67 |
| 9 | 1600 | 173 | 1427 | 16000 | 3150 | 12850 | 1.00 | 1.00 | 3150.0 | 1.00 | 0.0 | 19.69 |
train_smpl = train_df_rf.sample(frac=0.1, random_state=1)
test_smpl = test_df_rf.sample(frac=0.1, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
rf_v3 = H2ORandomForestEstimator(
model_id = 'rf_v3',
ntrees = 300,
nfolds=10,
min_rows=100,
balance_classes = True,
seed=1234)
rf_v3.train(predictors,target,training_frame=train_hex)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: rf_v3
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 300.0 | 300.0 | 125323.0 | 7.0 | 12.0 | 8.963333 | 24.0 | 32.0 | 28.423334 |
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 0.14951065232172528 RMSE: 0.3866660734040747 MAE: 0.2992474803933356 RMSLE: 0.2712098866294993 Mean Residual Deviance: 0.14951065232172528
ModelMetricsRegression: drf ** Reported on cross-validation data. ** MSE: 0.14963480040402838 RMSE: 0.3868265766516416 MAE: 0.299352536320045 RMSLE: 0.27131915462305656 Mean Residual Deviance: 0.14963480040402838
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | cv_6_valid | cv_7_valid | cv_8_valid | cv_9_valid | cv_10_valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mae | 0.2993906 | 0.0082924 | 0.3132309 | 0.2944414 | 0.3019102 | 0.2925448 | 0.3000576 | 0.3052748 | 0.2901710 | 0.3090672 | 0.2877329 | 0.2994755 |
| mean_residual_deviance | 0.1496330 | 0.0094428 | 0.1652384 | 0.1433392 | 0.1545313 | 0.1437271 | 0.1504920 | 0.1577827 | 0.1377648 | 0.1591601 | 0.1371658 | 0.1471285 |
| mse | 0.1496330 | 0.0094428 | 0.1652384 | 0.1433392 | 0.1545313 | 0.1437271 | 0.1504920 | 0.1577827 | 0.1377648 | 0.1591601 | 0.1371658 | 0.1471285 |
| r2 | 0.0231276 | 0.0151769 | 0.0032182 | 0.0241438 | 0.0142291 | 0.0122233 | 0.0187475 | 0.0222429 | 0.0333000 | 0.0565313 | 0.0124066 | 0.0342336 |
| residual_deviance | 0.1496330 | 0.0094428 | 0.1652384 | 0.1433392 | 0.1545313 | 0.1437271 | 0.1504920 | 0.1577827 | 0.1377648 | 0.1591601 | 0.1371658 | 0.1471285 |
| rmse | 0.3866515 | 0.0121850 | 0.4064952 | 0.3786016 | 0.3931047 | 0.3791136 | 0.3879329 | 0.3972187 | 0.3711669 | 0.3989487 | 0.3703589 | 0.3835733 |
| rmsle | 0.2712443 | 0.0065110 | 0.2830140 | 0.2670296 | 0.2748869 | 0.2674065 | 0.2722500 | 0.2767344 | 0.2627884 | 0.2759479 | 0.2629195 | 0.2694657 |
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 23:31:26 | 33.817 sec | 0.0 | nan | nan | nan | |
| 2023-07-26 23:31:26 | 33.827 sec | 1.0 | 0.3875406 | 0.2971530 | 0.1501877 | |
| 2023-07-26 23:31:26 | 33.836 sec | 2.0 | 0.3890889 | 0.3008566 | 0.1513902 | |
| 2023-07-26 23:31:26 | 33.844 sec | 3.0 | 0.3882861 | 0.2995615 | 0.1507661 | |
| 2023-07-26 23:31:26 | 33.853 sec | 4.0 | 0.3869131 | 0.2988633 | 0.1497017 | |
| 2023-07-26 23:31:26 | 33.863 sec | 5.0 | 0.3873920 | 0.3000798 | 0.1500726 | |
| 2023-07-26 23:31:26 | 33.872 sec | 6.0 | 0.3873828 | 0.2999194 | 0.1500654 | |
| 2023-07-26 23:31:26 | 33.880 sec | 7.0 | 0.3879887 | 0.3004147 | 0.1505352 | |
| 2023-07-26 23:31:26 | 33.889 sec | 8.0 | 0.3881241 | 0.3000204 | 0.1506403 | |
| 2023-07-26 23:31:26 | 33.896 sec | 9.0 | 0.3876442 | 0.3000743 | 0.1502680 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 23:31:29 | 36.777 sec | 291.0 | 0.3866741 | 0.2992291 | 0.1495169 | |
| 2023-07-26 23:31:29 | 36.786 sec | 292.0 | 0.3866675 | 0.2992232 | 0.1495117 | |
| 2023-07-26 23:31:29 | 36.795 sec | 293.0 | 0.3866678 | 0.2992219 | 0.1495120 | |
| 2023-07-26 23:31:29 | 36.804 sec | 294.0 | 0.3866633 | 0.2992180 | 0.1495085 | |
| 2023-07-26 23:31:29 | 36.819 sec | 295.0 | 0.3866660 | 0.2992240 | 0.1495106 | |
| 2023-07-26 23:31:29 | 36.828 sec | 296.0 | 0.3866689 | 0.2992322 | 0.1495128 | |
| 2023-07-26 23:31:29 | 36.837 sec | 297.0 | 0.3866656 | 0.2992397 | 0.1495103 | |
| 2023-07-26 23:31:29 | 36.848 sec | 298.0 | 0.3866640 | 0.2992462 | 0.1495091 | |
| 2023-07-26 23:31:29 | 36.856 sec | 299.0 | 0.3866674 | 0.2992554 | 0.1495117 | |
| 2023-07-26 23:31:29 | 36.865 sec | 300.0 | 0.3866661 | 0.2992475 | 0.1495107 |
[301 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 2103.5937500 | 1.0 | 0.2383542 |
| TD005_WOE | 1458.4207764 | 0.6932996 | 0.1652509 |
| PA029_bin_WOE | 1220.0726318 | 0.5799944 | 0.1382441 |
| TD014_bin_WOE | 594.2836304 | 0.2825087 | 0.0673371 |
| CR019_WOE | 551.7902832 | 0.2623084 | 0.0625223 |
| CR015_bin_WOE | 534.6726685 | 0.2541711 | 0.0605827 |
| PA023_bin_WOE | 413.6450500 | 0.1966373 | 0.0468693 |
| AP003_bin_WOE | 331.2231140 | 0.1574558 | 0.0375303 |
| AP001_WOE | 304.8792114 | 0.1449326 | 0.0345453 |
| AP008_WOE | 283.7080994 | 0.1348683 | 0.0321464 |
| TD010_bin_WOE | 267.6243286 | 0.1272224 | 0.0303240 |
| TD001_bin_WOE | 251.5751953 | 0.1195931 | 0.0285055 |
| CR009_bin_WOE | 246.8613586 | 0.1173522 | 0.0279714 |
| PA022_bin_WOE | 185.5974274 | 0.0882287 | 0.0210297 |
| TD006_bin_WOE | 77.5485992 | 0.0368648 | 0.0087869 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
ROC_AUC(rf_v3,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
createGains(rf_v3)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 160 | 47 | 113 | 160 | 47 | 113 | 0.16 | 0.09 | 30.0 | 1.57 | 7.0 | 29.38 |
| 1 | 160 | 42 | 118 | 320 | 89 | 231 | 0.30 | 0.18 | 60.0 | 1.48 | 12.0 | 27.81 |
| 2 | 160 | 41 | 119 | 480 | 130 | 350 | 0.43 | 0.27 | 90.0 | 1.44 | 16.0 | 27.08 |
| 3 | 160 | 35 | 125 | 640 | 165 | 475 | 0.55 | 0.37 | 120.0 | 1.38 | 18.0 | 25.78 |
| 4 | 160 | 31 | 129 | 800 | 196 | 604 | 0.65 | 0.46 | 150.0 | 1.31 | 19.0 | 24.50 |
| 5 | 160 | 24 | 136 | 960 | 220 | 740 | 0.73 | 0.57 | 180.0 | 1.22 | 16.0 | 22.92 |
| 6 | 160 | 23 | 137 | 1120 | 243 | 877 | 0.81 | 0.67 | 210.0 | 1.16 | 14.0 | 21.70 |
| 7 | 160 | 16 | 144 | 1280 | 259 | 1021 | 0.86 | 0.79 | 240.0 | 1.08 | 7.0 | 20.23 |
| 8 | 160 | 23 | 137 | 1440 | 282 | 1158 | 0.94 | 0.89 | 270.0 | 1.04 | 5.0 | 19.58 |
| 9 | 160 | 18 | 142 | 1600 | 300 | 1300 | 1.00 | 1.00 | 300.0 | 1.00 | 0.0 | 18.75 |
#Concatenate along rows (vertically)
#data_undersample = pd.concat([train_df_rf, test_df_rf])
#data_undersample = data_undersample.sort_values(by='id', ascending=True)
#data_undersample
y = train_df_rf[target]
X = train_df_rf.drop(target,axis=1)
y.dtypes
dtype('int64')
y1_cnt = train_df_rf[target].sum()
y1_cnt
12338
N = 2
y0_cnt = y1_cnt * N
y0_cnt
24676
pip install imblearn
Collecting imblearn Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB) Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.10.1) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.22.4) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.10.1) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.2.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.3.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.2.0) Installing collected packages: imblearn Successfully installed imblearn-0.0
from imblearn.datasets import make_imbalance
X_rs, y_rs = make_imbalance(X, y,
sampling_strategy={1:y1_cnt , 0: y0_cnt},
random_state=0)
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
y_rs = train_df_rf[train_df_rf[target]==1]
X_rs = train_df_rf[train_df_rf[target]==0].sample(n=y0_cnt)
smpl = pd.concat([X_rs,y_rs])
smpl_hex = h2o.H2OFrame(smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
rf_v4 = H2ORandomForestEstimator(
model_id = 'rf_v4',
ntrees = 300,
nfolds=10,
min_rows=100,
seed=1234)
rf_v4.train(predictors,target,training_frame=smpl_hex)
#train with the upsampled smpl_hex as the training frame
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: rf_v4
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 300.0 | 300.0 | 631596.0 | 12.0 | 18.0 | 14.2 | 149.0 | 174.0 | 163.03 |
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 0.21077186366658382 RMSE: 0.4590989693590956 MAE: 0.4228487633192769 RMSLE: 0.3227436376803513 Mean Residual Deviance: 0.21077186366658382
ModelMetricsRegression: drf ** Reported on cross-validation data. ** MSE: 0.21079706546892607 RMSE: 0.4591264155643041 MAE: 0.4230214030525104 RMSLE: 0.32274926426018896 Mean Residual Deviance: 0.21079706546892607
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | cv_6_valid | cv_7_valid | cv_8_valid | cv_9_valid | cv_10_valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mae | 0.4230262 | 0.0016552 | 0.4242583 | 0.4226995 | 0.4232406 | 0.4205764 | 0.4217563 | 0.4217314 | 0.4248930 | 0.4236615 | 0.4258246 | 0.4216206 |
| mean_residual_deviance | 0.2108000 | 0.0015687 | 0.2116346 | 0.2108016 | 0.2111888 | 0.2086496 | 0.2087994 | 0.2098977 | 0.2120005 | 0.2121581 | 0.2134057 | 0.2094645 |
| mse | 0.2108000 | 0.0015687 | 0.2116346 | 0.2108016 | 0.2111888 | 0.2086496 | 0.2087994 | 0.2098977 | 0.2120005 | 0.2121581 | 0.2134057 | 0.2094645 |
| r2 | 0.0512652 | 0.0054807 | 0.0512642 | 0.0533489 | 0.0554153 | 0.0419659 | 0.0552363 | 0.0527978 | 0.0554222 | 0.0512500 | 0.0407212 | 0.0552302 |
| residual_deviance | 0.2108000 | 0.0015687 | 0.2116346 | 0.2108016 | 0.2111888 | 0.2086496 | 0.2087994 | 0.2098977 | 0.2120005 | 0.2121581 | 0.2134057 | 0.2094645 |
| rmse | 0.4591268 | 0.0017082 | 0.4600376 | 0.4591313 | 0.4595528 | 0.4567817 | 0.4569457 | 0.4581459 | 0.4604351 | 0.4606062 | 0.4619585 | 0.4576729 |
| rmsle | 0.3227511 | 0.0008634 | 0.3232280 | 0.32237 | 0.3225686 | 0.3228075 | 0.3220436 | 0.3219627 | 0.3228393 | 0.3230383 | 0.3248140 | 0.3218389 |
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 23:34:15 | 2 min 38.652 sec | 0.0 | nan | nan | nan | |
| 2023-07-26 23:34:15 | 2 min 38.703 sec | 1.0 | 0.4645581 | 0.4238703 | 0.2158143 | |
| 2023-07-26 23:34:15 | 2 min 38.747 sec | 2.0 | 0.4622949 | 0.4223424 | 0.2137165 | |
| 2023-07-26 23:34:15 | 2 min 38.797 sec | 3.0 | 0.4623465 | 0.4230279 | 0.2137643 | |
| 2023-07-26 23:34:15 | 2 min 38.844 sec | 4.0 | 0.4627498 | 0.4236462 | 0.2141374 | |
| 2023-07-26 23:34:15 | 2 min 38.891 sec | 5.0 | 0.4622605 | 0.4230023 | 0.2136848 | |
| 2023-07-26 23:34:15 | 2 min 38.938 sec | 6.0 | 0.4615715 | 0.4228065 | 0.2130483 | |
| 2023-07-26 23:34:15 | 2 min 38.981 sec | 7.0 | 0.4611671 | 0.4228217 | 0.2126751 | |
| 2023-07-26 23:34:15 | 2 min 39.023 sec | 8.0 | 0.4608365 | 0.4227504 | 0.2123702 | |
| 2023-07-26 23:34:15 | 2 min 39.068 sec | 9.0 | 0.4606891 | 0.4227239 | 0.2122345 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 23:34:19 | 2 min 42.335 sec | 81.0 | 0.4592008 | 0.4228961 | 0.2108654 | |
| 2023-07-26 23:34:19 | 2 min 42.382 sec | 82.0 | 0.4591884 | 0.4228797 | 0.2108540 | |
| 2023-07-26 23:34:19 | 2 min 42.425 sec | 83.0 | 0.4591783 | 0.4228581 | 0.2108448 | |
| 2023-07-26 23:34:19 | 2 min 42.467 sec | 84.0 | 0.4591619 | 0.4228475 | 0.2108296 | |
| 2023-07-26 23:34:19 | 2 min 42.520 sec | 85.0 | 0.4591572 | 0.4228333 | 0.2108253 | |
| 2023-07-26 23:34:19 | 2 min 42.566 sec | 86.0 | 0.4591530 | 0.4228493 | 0.2108215 | |
| 2023-07-26 23:34:19 | 2 min 42.611 sec | 87.0 | 0.4591583 | 0.4228466 | 0.2108264 | |
| 2023-07-26 23:34:23 | 2 min 46.640 sec | 162.0 | 0.4591054 | 0.4228300 | 0.2107778 | |
| 2023-07-26 23:34:27 | 2 min 50.647 sec | 255.0 | 0.4590985 | 0.4228171 | 0.2107714 | |
| 2023-07-26 23:34:29 | 2 min 52.567 sec | 300.0 | 0.4590990 | 0.4228488 | 0.2107719 |
[91 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 24682.8769531 | 1.0 | 0.2158746 |
| TD005_WOE | 19699.0332031 | 0.7980850 | 0.1722863 |
| TD014_bin_WOE | 11587.6376953 | 0.4694606 | 0.1013446 |
| AP003_bin_WOE | 10638.4736328 | 0.4310062 | 0.0930433 |
| CR015_bin_WOE | 9107.7890625 | 0.3689922 | 0.0796560 |
| PA029_bin_WOE | 5926.0659180 | 0.2400881 | 0.0518289 |
| TD010_bin_WOE | 5356.1132812 | 0.2169971 | 0.0468442 |
| AP008_WOE | 5094.9741211 | 0.2064174 | 0.0445603 |
| PA022_bin_WOE | 4758.9101562 | 0.1928021 | 0.0416211 |
| CR019_WOE | 4499.7324219 | 0.1823018 | 0.0393543 |
| TD001_bin_WOE | 3518.0322266 | 0.1425293 | 0.0307684 |
| AP001_WOE | 3328.6904297 | 0.1348583 | 0.0291125 |
| PA023_bin_WOE | 3035.7680664 | 0.1229909 | 0.0265506 |
| CR009_bin_WOE | 1846.0827637 | 0.0747920 | 0.0161457 |
| TD006_bin_WOE | 1258.7723389 | 0.0509978 | 0.0110091 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
ROC_AUC(rf_v4,smpl_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
ROC_AUC(rf_v4,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
createGains(rf_v4)
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 160 | 47 | 113 | 160 | 47 | 113 | 0.16 | 0.09 | 30.0 | 1.57 | 7.0 | 29.38 |
| 1 | 160 | 48 | 112 | 320 | 95 | 225 | 0.32 | 0.17 | 60.0 | 1.58 | 15.0 | 29.69 |
| 2 | 160 | 40 | 120 | 480 | 135 | 345 | 0.45 | 0.27 | 90.0 | 1.50 | 18.0 | 28.12 |
| 3 | 160 | 37 | 123 | 640 | 172 | 468 | 0.57 | 0.36 | 120.0 | 1.43 | 21.0 | 26.88 |
| 4 | 160 | 25 | 135 | 800 | 197 | 603 | 0.66 | 0.46 | 150.0 | 1.31 | 20.0 | 24.62 |
| 5 | 160 | 26 | 134 | 960 | 223 | 737 | 0.74 | 0.57 | 180.0 | 1.24 | 17.0 | 23.23 |
| 6 | 160 | 16 | 144 | 1120 | 239 | 881 | 0.80 | 0.68 | 210.0 | 1.14 | 12.0 | 21.34 |
| 7 | 160 | 19 | 141 | 1280 | 258 | 1022 | 0.86 | 0.79 | 240.0 | 1.08 | 7.0 | 20.16 |
| 8 | 160 | 26 | 134 | 1440 | 284 | 1156 | 0.95 | 0.89 | 270.0 | 1.05 | 6.0 | 19.72 |
| 9 | 160 | 16 | 144 | 1600 | 300 | 1300 | 1.00 | 1.00 | 300.0 | 1.00 | 0.0 | 18.75 |
from imblearn.over_sampling import RandomOverSampler
# Assuming you have a DataFrame train_df_rf with your training data
target = 'loan_default'
X = train_df_rf.drop(target, axis=1)
y = train_df_rf[target]
# Instantiate the RandomOverSampler
ros = RandomOverSampler(random_state=0)
# Perform the Random Over-Sampling on the data
X_ros, y_ros = ros.fit_resample(X, y)
X_ros = pd.DataFrame(X_ros)
y_ros = pd.DataFrame(y_ros)
y_ros = train_df_rf[train_df_rf[target]==1]
X_ros = train_df_rf[train_df_rf[target]==0].sample(n=y0_cnt)
smpl2 = pd.concat([X_ros,y_ros])
smpl_hex2 = h2o.H2OFrame(smpl2)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
rf_v5 = H2ORandomForestEstimator(
model_id = 'rf_v5',
ntrees = 300,
nfolds=10,
min_rows=100,
seed=1234)
rf_v5.train(predictors,target,training_frame=smpl_hex2)
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: rf_v5
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 300.0 | 300.0 | 631419.0 | 12.0 | 18.0 | 14.646667 | 153.0 | 173.0 | 163.04 |
ModelMetricsRegression: drf ** Reported on train data. ** MSE: 0.2108294692109558 RMSE: 0.45916170268322226 MAE: 0.4230355253292622 RMSLE: 0.32276222801680743 Mean Residual Deviance: 0.2108294692109558
ModelMetricsRegression: drf ** Reported on cross-validation data. ** MSE: 0.21091192465924377 RMSE: 0.4592514830234561 MAE: 0.4232599539533539 RMSLE: 0.3228381427219545 Mean Residual Deviance: 0.21091192465924377
| mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | cv_6_valid | cv_7_valid | cv_8_valid | cv_9_valid | cv_10_valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mae | 0.4232679 | 0.0016838 | 0.4253459 | 0.4241945 | 0.4224509 | 0.4198744 | 0.4224276 | 0.4250855 | 0.4217607 | 0.4231897 | 0.4245106 | 0.4238390 |
| mean_residual_deviance | 0.2109185 | 0.0015538 | 0.2124475 | 0.2120962 | 0.2102032 | 0.2082453 | 0.2090172 | 0.2123263 | 0.2099796 | 0.2115506 | 0.2127480 | 0.2105706 |
| mse | 0.2109185 | 0.0015538 | 0.2124475 | 0.2120962 | 0.2102032 | 0.2082453 | 0.2090172 | 0.2123263 | 0.2099796 | 0.2115506 | 0.2127480 | 0.2105706 |
| r2 | 0.0507201 | 0.0073772 | 0.0476200 | 0.0475348 | 0.0598237 | 0.0438222 | 0.0542511 | 0.0418383 | 0.0644265 | 0.0539665 | 0.0436772 | 0.0502408 |
| residual_deviance | 0.2109185 | 0.0015538 | 0.2124475 | 0.2120962 | 0.2102032 | 0.2082453 | 0.2090172 | 0.2123263 | 0.2099796 | 0.2115506 | 0.2127480 | 0.2105706 |
| rmse | 0.4592558 | 0.0016929 | 0.4609203 | 0.4605391 | 0.4584792 | 0.4563390 | 0.4571839 | 0.4607888 | 0.4582353 | 0.4599463 | 0.4612462 | 0.4588798 |
| rmsle | 0.3228421 | 0.0011548 | 0.3240515 | 0.3234761 | 0.3216558 | 0.3223401 | 0.3222780 | 0.3242907 | 0.3207237 | 0.3225305 | 0.3240419 | 0.3230334 |
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 23:37:30 | 2 min 50.855 sec | 0.0 | nan | nan | nan | |
| 2023-07-26 23:37:30 | 2 min 50.982 sec | 1.0 | 0.4634545 | 0.4230118 | 0.2147901 | |
| 2023-07-26 23:37:30 | 2 min 51.102 sec | 2.0 | 0.4629596 | 0.4226218 | 0.2143316 | |
| 2023-07-26 23:37:30 | 2 min 51.180 sec | 3.0 | 0.4622660 | 0.4228333 | 0.2136898 | |
| 2023-07-26 23:37:31 | 2 min 51.239 sec | 4.0 | 0.4629738 | 0.4237793 | 0.2143448 | |
| 2023-07-26 23:37:31 | 2 min 51.311 sec | 5.0 | 0.4626122 | 0.4233074 | 0.2140101 | |
| 2023-07-26 23:37:31 | 2 min 51.379 sec | 6.0 | 0.4620551 | 0.4232934 | 0.2134949 | |
| 2023-07-26 23:37:31 | 2 min 51.446 sec | 7.0 | 0.4615999 | 0.4233006 | 0.2130745 | |
| 2023-07-26 23:37:31 | 2 min 51.498 sec | 8.0 | 0.4613335 | 0.4233212 | 0.2128286 | |
| 2023-07-26 23:37:31 | 2 min 51.552 sec | 9.0 | 0.4610778 | 0.4232437 | 0.2125928 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 23:37:34 | 2 min 54.613 sec | 58.0 | 0.4593532 | 0.4231081 | 0.2110054 | |
| 2023-07-26 23:37:34 | 2 min 54.675 sec | 59.0 | 0.4593352 | 0.4230896 | 0.2109888 | |
| 2023-07-26 23:37:34 | 2 min 54.725 sec | 60.0 | 0.4593225 | 0.4230810 | 0.2109772 | |
| 2023-07-26 23:37:34 | 2 min 54.768 sec | 61.0 | 0.4592950 | 0.4230651 | 0.2109519 | |
| 2023-07-26 23:37:34 | 2 min 54.811 sec | 62.0 | 0.4593009 | 0.4230671 | 0.2109573 | |
| 2023-07-26 23:37:34 | 2 min 54.855 sec | 63.0 | 0.4592915 | 0.4230444 | 0.2109487 | |
| 2023-07-26 23:37:38 | 2 min 58.891 sec | 155.0 | 0.4591898 | 0.4230735 | 0.2108553 | |
| 2023-07-26 23:37:42 | 3 min 2.910 sec | 246.0 | 0.4591784 | 0.4229916 | 0.2108448 | |
| 2023-07-26 23:37:46 | 3 min 6.914 sec | 299.0 | 0.4591607 | 0.4230368 | 0.2108286 | |
| 2023-07-26 23:37:46 | 3 min 6.993 sec | 300.0 | 0.4591617 | 0.4230355 | 0.2108295 |
[68 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 24170.1152344 | 1.0 | 0.2141307 |
| TD005_WOE | 19442.2890625 | 0.8043937 | 0.1722454 |
| TD014_bin_WOE | 11670.7021484 | 0.4828567 | 0.1033945 |
| AP003_bin_WOE | 10944.4785156 | 0.4528104 | 0.0969606 |
| CR015_bin_WOE | 8672.9453125 | 0.3588293 | 0.0768364 |
| PA022_bin_WOE | 5423.5327148 | 0.2243900 | 0.0480488 |
| AP008_WOE | 5042.3710938 | 0.2086201 | 0.0446720 |
| TD010_bin_WOE | 4650.3427734 | 0.1924005 | 0.0411989 |
| TD001_bin_WOE | 4553.0815430 | 0.1883765 | 0.0403372 |
| CR019_WOE | 4020.8261719 | 0.1663553 | 0.0356218 |
| PA029_bin_WOE | 3881.5087891 | 0.1605912 | 0.0343875 |
| PA023_bin_WOE | 3716.6398926 | 0.1537701 | 0.0329269 |
| AP001_WOE | 3558.7382812 | 0.1472371 | 0.0315280 |
| CR009_bin_WOE | 2015.2391357 | 0.0833773 | 0.0178536 |
| TD006_bin_WOE | 1112.7031250 | 0.0460363 | 0.0098578 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
ROC_AUC(rf_v5,smpl_hex2,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
ROC_AUC(rf_v5,test_hex,'loan_default')
drf prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
(1) If the feature set doesn't contain strong discriminatory information for the minority class, balancing the data alone might not lead to substantial improvements. (2) If the original dataset is well-balanced, representative, and contains sufficient information for the classifier to learn, then both oversampling and undersampling might not have a substantial impact on the model's performance.
from sklearn.tree import DecisionTreeClassifier # for classification
from sklearn.tree import DecisionTreeRegressor # for regression
# First, specify the model
dtree = DecisionTreeClassifier(min_samples_leaf = 5, max_depth = 6)
# Then, train the model.
dtree.fit(train_df_WOE_withoutid,train_df.target)
DecisionTreeClassifier(max_depth=6, min_samples_leaf=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=6, min_samples_leaf=5)
features = ['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
predictions = dtree.predict(test_df_WOE_withoutid[features])
predictions
array([0, 0, 0, ..., 0, 0, 0])
dtree.predict_proba(test_df_WOE_withoutid[features])
array([[0.85357873, 0.14642127],
[0.7785124 , 0.2214876 ],
[0.95964126, 0.04035874],
...,
[0.85357873, 0.14642127],
[0.86291827, 0.13708173],
[0.74327628, 0.25672372]])
y_pred = dtree.predict_proba(test_df_WOE_withoutid[features])[:,1]
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, confusion_matrix
roc_auc_value = roc_auc_score(test_df.target,y_pred)
roc_auc_value
0.598508109443518
fpr, tpr, _ = roc_curve(test_df.target,y_pred)
[fpr,tpr]
[array([0.00000000e+00, 5.44747082e-04, 1.08949416e-03, 1.16731518e-03,
9.49416342e-03, 1.06614786e-02, 1.15953307e-02, 1.27626459e-02,
2.87937743e-02, 3.19066148e-02, 4.38910506e-02, 4.69260700e-02,
4.87937743e-02, 6.07003891e-02, 9.06614786e-02, 9.53307393e-02,
1.57821012e-01, 1.94941634e-01, 2.06536965e-01, 2.27003891e-01,
2.34007782e-01, 2.39922179e-01, 2.60778210e-01, 3.08404669e-01,
3.28871595e-01, 3.41400778e-01, 3.46381323e-01, 3.83424125e-01,
3.83424125e-01, 4.29260700e-01, 4.49105058e-01, 4.51517510e-01,
4.55642023e-01, 4.64980545e-01, 4.66147860e-01, 4.81478599e-01,
5.48560311e-01, 5.50350195e-01, 5.53307393e-01, 5.59299611e-01,
5.98287938e-01, 6.05214008e-01, 6.63891051e-01, 6.78832685e-01,
6.83968872e-01, 7.88949416e-01, 7.96342412e-01, 8.03657588e-01,
8.29571984e-01, 8.39844358e-01, 8.50583658e-01, 8.57976654e-01,
9.50894942e-01, 9.52217899e-01, 9.58521401e-01, 9.81556420e-01,
9.85291829e-01, 9.91361868e-01, 9.94863813e-01, 9.96342412e-01,
9.99610895e-01, 1.00000000e+00]),
array([0.00000000e+00, 3.17460317e-04, 6.34920635e-04, 6.34920635e-04,
1.39682540e-02, 1.49206349e-02, 1.52380952e-02, 1.61904762e-02,
5.17460317e-02, 5.61904762e-02, 6.98412698e-02, 7.36507937e-02,
7.74603175e-02, 1.10158730e-01, 1.68571429e-01, 1.73015873e-01,
2.41904762e-01, 2.93015873e-01, 3.06666667e-01, 3.37460317e-01,
3.49206349e-01, 3.56190476e-01, 3.98730159e-01, 4.64761905e-01,
4.94920635e-01, 5.06349206e-01, 5.12063492e-01, 5.52380952e-01,
5.52698413e-01, 5.82222222e-01, 6.00317460e-01, 6.01269841e-01,
6.03809524e-01, 6.08253968e-01, 6.11428571e-01, 6.28888889e-01,
6.65079365e-01, 6.66349206e-01, 6.66984127e-01, 6.71111111e-01,
7.19047619e-01, 7.25714286e-01, 7.83809524e-01, 7.95238095e-01,
7.97142857e-01, 8.73650794e-01, 8.81269841e-01, 8.85396825e-01,
8.99047619e-01, 9.01269841e-01, 9.08888889e-01, 9.12063492e-01,
9.78095238e-01, 9.80000000e-01, 9.82222222e-01, 9.89841270e-01,
9.93015873e-01, 9.96825397e-01, 9.97777778e-01, 9.99047619e-01,
9.99682540e-01, 1.00000000e+00])]
import matplotlib.pyplot as plt
lw=2
plt.figure(figsize=(6,4))
plt.plot(fpr,tpr, color='darkorange',lw=lw,label='ROC curve (area = %0.2f)' %roc_auc_value)
plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.show()
Gradient Boosting Machine is a forward learning ensemble method. It is a powerful and popular machine learning algorithm used for both regression and classification tasks. It works by combining multiple weak learners, typically decision trees, in an iterative manner. Each subsequent tree corrects the errors made by the previous ones, gradually improving the model's predictive accuracy. GBM optimizes a loss function using gradient descent to find the best possible ensemble of trees.
H2O's GBM is an implementation of the GBM algorithm designed for high-performance and scalability. It offers various tuning parameters and options for model customization, making it a preferred choice for many data scientists and engineers dealing with big data scenarios.
train_df_gbm = train_df_rf
test_df_gbm = test_df_rf
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_gbm.columns.tolist()
predictors=predictors[2:17]
predictors
['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
#Use 50% training data
train_smpl = train_df_gbm.sample(frac=0.5, random_state=1)
test_smpl = test_df_gbm.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm_v1 = H2OGradientBoostingEstimator(
model_id = 'gbm_v1',
seed=1234)
gbm_v1.train(predictors,target,training_frame=train_hex)
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: gbm_v1
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 50.0 | 50.0 | 22251.0 | 5.0 | 5.0 | 5.0 | 24.0 | 32.0 | 30.72 |
ModelMetricsRegression: gbm ** Reported on train data. ** MSE: 0.144712223309159 RMSE: 0.3804105983134001 MAE: 0.29368577951529734 RMSLE: 0.2663937423135674 Mean Residual Deviance: 0.144712223309159
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 01:00:03 | 0.350 sec | 0.0 | 0.3950172 | 0.3120772 | 0.1560386 | |
| 2023-07-26 01:00:04 | 1.243 sec | 1.0 | 0.3934904 | 0.3108099 | 0.1548347 | |
| 2023-07-26 01:00:05 | 1.560 sec | 2.0 | 0.3922290 | 0.3096628 | 0.1538436 | |
| 2023-07-26 01:00:05 | 1.754 sec | 3.0 | 0.3911679 | 0.3086323 | 0.1530123 | |
| 2023-07-26 01:00:05 | 1.940 sec | 4.0 | 0.3902702 | 0.3076790 | 0.1523109 | |
| 2023-07-26 01:00:05 | 2.092 sec | 5.0 | 0.3894680 | 0.3067850 | 0.1516854 | |
| 2023-07-26 01:00:05 | 2.256 sec | 6.0 | 0.3888119 | 0.3059794 | 0.1511747 | |
| 2023-07-26 01:00:05 | 2.405 sec | 7.0 | 0.3882096 | 0.3052193 | 0.1507067 | |
| 2023-07-26 01:00:06 | 2.573 sec | 8.0 | 0.3876805 | 0.3045070 | 0.1502961 | |
| 2023-07-26 01:00:06 | 2.711 sec | 9.0 | 0.3872399 | 0.3038831 | 0.1499547 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 01:00:07 | 3.526 sec | 15.0 | 0.3853087 | 0.3008516 | 0.1484628 | |
| 2023-07-26 01:00:07 | 3.653 sec | 16.0 | 0.3850870 | 0.3004736 | 0.1482920 | |
| 2023-07-26 01:00:07 | 3.760 sec | 17.0 | 0.3848541 | 0.3000817 | 0.1481127 | |
| 2023-07-26 01:00:07 | 3.864 sec | 18.0 | 0.3846232 | 0.2996997 | 0.1479350 | |
| 2023-07-26 01:00:07 | 3.994 sec | 19.0 | 0.3844344 | 0.2993804 | 0.1477898 | |
| 2023-07-26 01:00:07 | 4.101 sec | 20.0 | 0.3842524 | 0.2990728 | 0.1476499 | |
| 2023-07-26 01:00:07 | 4.209 sec | 21.0 | 0.3840773 | 0.2988061 | 0.1475154 | |
| 2023-07-26 01:00:07 | 4.287 sec | 22.0 | 0.3839034 | 0.2985029 | 0.1473818 | |
| 2023-07-26 01:00:07 | 4.375 sec | 23.0 | 0.3837294 | 0.2982259 | 0.1472483 | |
| 2023-07-26 01:00:10 | 7.187 sec | 50.0 | 0.3804106 | 0.2936858 | 0.1447122 |
[25 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 375.0860901 | 1.0 | 0.1963713 |
| TD005_WOE | 277.0020447 | 0.7385026 | 0.1450207 |
| AP003_bin_WOE | 199.6016083 | 0.5321488 | 0.1044987 |
| CR015_bin_WOE | 165.0644226 | 0.4400708 | 0.0864173 |
| AP001_WOE | 145.4165802 | 0.3876885 | 0.0761309 |
| CR019_WOE | 144.0874481 | 0.3841450 | 0.0754350 |
| TD014_bin_WOE | 116.8752594 | 0.3115958 | 0.0611885 |
| AP008_WOE | 104.1815643 | 0.2777537 | 0.0545429 |
| PA023_bin_WOE | 92.1280365 | 0.2456184 | 0.0482324 |
| PA029_bin_WOE | 73.7319412 | 0.1965734 | 0.0386014 |
| TD001_bin_WOE | 66.4085541 | 0.1770488 | 0.0347673 |
| CR009_bin_WOE | 52.7706642 | 0.1406895 | 0.0276274 |
| PA022_bin_WOE | 46.0381508 | 0.1227402 | 0.0241027 |
| TD010_bin_WOE | 36.5775566 | 0.0975178 | 0.0191497 |
| TD006_bin_WOE | 15.1164932 | 0.0403014 | 0.0079140 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
VarImp(gbm_v1)
createGains(gbm_v1)
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 1600 | 468 | 1132 | 1600 | 468 | 1132 | 0.15 | 0.09 | 315.0 | 1.49 | 6.0 | 29.25 |
| 1 | 1600 | 427 | 1173 | 3200 | 895 | 2305 | 0.28 | 0.18 | 630.0 | 1.42 | 10.0 | 27.97 |
| 2 | 1600 | 383 | 1217 | 4800 | 1278 | 3522 | 0.41 | 0.27 | 945.0 | 1.35 | 14.0 | 26.62 |
| 3 | 1600 | 361 | 1239 | 6400 | 1639 | 4761 | 0.52 | 0.37 | 1260.0 | 1.30 | 15.0 | 25.61 |
| 4 | 1600 | 315 | 1285 | 8000 | 1954 | 6046 | 0.62 | 0.47 | 1575.0 | 1.24 | 15.0 | 24.42 |
| 5 | 1600 | 259 | 1341 | 9600 | 2213 | 7387 | 0.70 | 0.57 | 1890.0 | 1.17 | 13.0 | 23.05 |
| 6 | 1600 | 241 | 1359 | 11200 | 2454 | 8746 | 0.78 | 0.68 | 2205.0 | 1.11 | 10.0 | 21.91 |
| 7 | 1600 | 257 | 1343 | 12800 | 2711 | 10089 | 0.86 | 0.79 | 2520.0 | 1.08 | 7.0 | 21.18 |
| 8 | 1600 | 251 | 1349 | 14400 | 2962 | 11438 | 0.94 | 0.89 | 2835.0 | 1.04 | 5.0 | 20.57 |
| 9 | 1600 | 188 | 1412 | 16000 | 3150 | 12850 | 1.00 | 1.00 | 3150.0 | 1.00 | 0.0 | 19.69 |
ROC_AUC(gbm_v1,test_hex,'loan_default')
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_gbm.columns.tolist()
predictors=predictors[2:17]
values_to_remove = ['TD006_bin_WOE', 'TD010_bin_WOE']
predictors = [item for item in predictors if item not in values_to_remove]
predictors
['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD009_bin_WOE', 'TD014_bin_WOE']
gbm_v2 = H2OGradientBoostingEstimator(
model_id = 'gbm_v1',
seed=1234)
gbm_v2.train(predictors,target,training_frame=train_hex)
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: gbm_v1
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 50.0 | 50.0 | 22361.0 | 5.0 | 5.0 | 5.0 | 26.0 | 32.0 | 30.9 |
ModelMetricsRegression: gbm ** Reported on train data. ** MSE: 0.14693649700108505 RMSE: 0.3833229669626972 MAE: 0.29628467947138415 RMSLE: 0.2686390871878489 Mean Residual Deviance: 0.14693649700108505
| timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
|---|---|---|---|---|---|---|
| 2023-07-26 01:29:24 | 0.019 sec | 0.0 | 0.3944827 | 0.3112333 | 0.1556166 | |
| 2023-07-26 01:29:25 | 0.160 sec | 1.0 | 0.3931736 | 0.3101478 | 0.1545855 | |
| 2023-07-26 01:29:25 | 0.268 sec | 2.0 | 0.3921000 | 0.3091753 | 0.1537424 | |
| 2023-07-26 01:29:25 | 0.388 sec | 3.0 | 0.3912023 | 0.3082834 | 0.1530392 | |
| 2023-07-26 01:29:25 | 0.528 sec | 4.0 | 0.3904423 | 0.3074741 | 0.1524452 | |
| 2023-07-26 01:29:25 | 0.711 sec | 5.0 | 0.3897861 | 0.3067226 | 0.1519332 | |
| 2023-07-26 01:29:25 | 0.832 sec | 6.0 | 0.3892457 | 0.3060523 | 0.1515122 | |
| 2023-07-26 01:29:25 | 0.945 sec | 7.0 | 0.3887421 | 0.3054030 | 0.1511204 | |
| 2023-07-26 01:29:26 | 1.068 sec | 8.0 | 0.3883223 | 0.3048271 | 0.1507942 | |
| 2023-07-26 01:29:26 | 1.183 sec | 9.0 | 0.3879452 | 0.3042796 | 0.1505015 | |
| --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 01:29:27 | 2.914 sec | 24.0 | 0.3851501 | 0.2994155 | 0.1483406 | |
| 2023-07-26 01:29:27 | 3.029 sec | 25.0 | 0.3850391 | 0.2992090 | 0.1482551 | |
| 2023-07-26 01:29:28 | 3.159 sec | 26.0 | 0.3849357 | 0.2990127 | 0.1481755 | |
| 2023-07-26 01:29:28 | 3.282 sec | 27.0 | 0.3848465 | 0.2988435 | 0.1481068 | |
| 2023-07-26 01:29:28 | 3.471 sec | 28.0 | 0.3847656 | 0.2986781 | 0.1480446 | |
| 2023-07-26 01:29:28 | 3.610 sec | 29.0 | 0.3846776 | 0.2985261 | 0.1479769 | |
| 2023-07-26 01:29:28 | 3.720 sec | 30.0 | 0.3846082 | 0.2983791 | 0.1479235 | |
| 2023-07-26 01:29:28 | 3.837 sec | 31.0 | 0.3845231 | 0.2982192 | 0.1478580 | |
| 2023-07-26 01:29:28 | 3.952 sec | 32.0 | 0.3844426 | 0.2980950 | 0.1477961 | |
| 2023-07-26 01:29:32 | 7.151 sec | 50.0 | 0.3833230 | 0.2962847 | 0.1469365 |
[34 rows x 7 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 746.8563232 | 1.0 | 0.2552778 |
| TD005_WOE | 400.7232666 | 0.5365467 | 0.1369684 |
| AP003_bin_WOE | 339.1337891 | 0.4540817 | 0.1159170 |
| CR015_bin_WOE | 263.8482056 | 0.3532784 | 0.0901841 |
| TD014_bin_WOE | 193.5591583 | 0.2591652 | 0.0661591 |
| AP001_WOE | 167.8301697 | 0.2247155 | 0.0573649 |
| AP008_WOE | 160.4264374 | 0.2148023 | 0.0548342 |
| PA029_bin_WOE | 153.1158142 | 0.2050137 | 0.0523355 |
| PA023_bin_WOE | 133.1398926 | 0.1782671 | 0.0455076 |
| CR019_WOE | 130.6924133 | 0.1749900 | 0.0446711 |
| TD001_bin_WOE | 79.9485245 | 0.1070467 | 0.0273267 |
| CR009_bin_WOE | 78.9457855 | 0.1057041 | 0.0269839 |
| PA022_bin_WOE | 77.4414291 | 0.1036899 | 0.0264697 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
createGains(gbm_v2)
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 1600 | 482 | 1118 | 1600 | 482 | 1118 | 0.15 | 0.09 | 315.0 | 1.53 | 6.0 | 30.12 |
| 1 | 1600 | 418 | 1182 | 3200 | 900 | 2300 | 0.29 | 0.18 | 630.0 | 1.43 | 11.0 | 28.12 |
| 2 | 1600 | 364 | 1236 | 4800 | 1264 | 3536 | 0.40 | 0.28 | 945.0 | 1.34 | 12.0 | 26.33 |
| 3 | 1600 | 360 | 1240 | 6400 | 1624 | 4776 | 0.52 | 0.37 | 1260.0 | 1.29 | 15.0 | 25.37 |
| 4 | 1600 | 305 | 1295 | 8000 | 1929 | 6071 | 0.61 | 0.47 | 1575.0 | 1.22 | 14.0 | 24.11 |
| 5 | 1600 | 265 | 1335 | 9600 | 2194 | 7406 | 0.70 | 0.58 | 1890.0 | 1.16 | 12.0 | 22.85 |
| 6 | 1600 | 256 | 1344 | 11200 | 2450 | 8750 | 0.78 | 0.68 | 2205.0 | 1.11 | 10.0 | 21.88 |
| 7 | 1600 | 269 | 1331 | 12800 | 2719 | 10081 | 0.86 | 0.78 | 2520.0 | 1.08 | 8.0 | 21.24 |
| 8 | 1600 | 243 | 1357 | 14400 | 2962 | 11438 | 0.94 | 0.89 | 2835.0 | 1.04 | 5.0 | 20.57 |
| 9 | 1600 | 188 | 1412 | 16000 | 3150 | 12850 | 1.00 | 1.00 | 3150.0 | 1.00 | 0.0 | 19.69 |
ROC_AUC(gbm_v2,test_hex,'loan_default')
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
Deep Learning involves the use of artificial neural networks composed of multiple layers of neurons. Each layer processes the input from the previous layer and gradually learns to extract higher-level representations of the data. Deep learning is particularly well-suited for complex tasks like image recognition, natural language processing, and speech recognition, where traditional machine learning techniques may struggle.
H2O's Deep Learning is based on a multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier, and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing, and grid search enable high predictive accuracy.
!pip install h2o
import h2o
from h2o.estimators import H2ODeepLearningEstimator
h2o.init()
Collecting h2o
Downloading h2o-3.42.0.2.tar.gz (249.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 249.1/249.1 MB 5.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from h2o) (2.27.1)
Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from h2o) (0.9.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (1.26.16)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2023.7.22)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2.0.12)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.4)
Building wheels for collected packages: h2o
Building wheel for h2o (setup.py) ... done
Created wheel for h2o: filename=h2o-3.42.0.2-py2.py3-none-any.whl size=249153908 sha256=c9674c27a88bbe137b165755d325702177458c3c652f353a06c2f8855e00e358
Stored in directory: /root/.cache/pip/wheels/31/f7/e0/e32942d9f76cb1cb14c949b7772eb78979d2e0132aae6c6780
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.42.0.2
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "11.0.19" 2023-04-18; OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1); OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
Starting server from /usr/local/lib/python3.10/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpieh9mrkw
JVM stdout: /tmp/tmpieh9mrkw/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpieh9mrkw/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
| H2O_cluster_uptime: | 03 secs |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.42.0.2 |
| H2O_cluster_version_age: | 1 day |
| H2O_cluster_name: | H2O_from_python_unknownUser_k99jd3 |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 3.170 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://127.0.0.1:54321 |
| H2O_connection_proxy: | {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"} |
| H2O_internal_security: | False |
| Python_version: | 3.10.6 final |
train_df_dl = train_df_rf
test_df_dl = test_df_rf
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_dl.columns.tolist()
predictors=predictors[2:17]
predictors
['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
#Use 50% training data
train_smpl = train_df_dl.sample(frac=0.5, random_state=1)
test_smpl = test_df_dl.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
# Build and train the model:
dl_v1 = H2ODeepLearningEstimator(distribution="tweedie",
hidden=[1],
epochs=1000,
train_samples_per_iteration=-1,
reproducible=True,
activation="Tanh",
single_node_mode=False,
balance_classes=False,
force_load_balance=False,
seed=23123,
tweedie_power=1.5,
score_training_samples=0,
score_validation_samples=0,
stopping_rounds=0)
dl_v1.train(x=predictors,
y=target,
training_frame=train_hex)
deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%
Model Details ============= H2ODeepLearningEstimator : Deep Learning Model Key: DeepLearning_model_python_1690391326174_1
| layer | units | type | dropout | l1 | l2 | mean_rate | rate_rms | momentum | mean_weight | weight_rms | mean_bias | bias_rms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 15 | Input | 0.0 | ||||||||||
| 2 | 1 | Tanh | 0.0 | 0.0 | 0.0 | 0.0004968 | 0.0001334 | 0.0 | 0.1476793 | 0.1542717 | -0.0440504 | 0.0000000 | |
| 3 | 1 | Linear | 0.0 | 0.0 | 0.0004108 | 0.0000000 | 0.0 | 0.6655666 | 0.0000000 | -1.7210161 | 0.0000000 |
ModelMetricsRegression: deeplearning ** Reported on train data. ** MSE: 0.14966607913911537 RMSE: 0.3868670044590458 MAE: 0.2986335369918558 RMSLE: 0.27133556429679573 Mean Residual Deviance: 1.8961575096973406
| timestamp | duration | training_speed | epochs | iterations | samples | training_rmse | training_deviance | training_mae | training_r2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2023-07-26 17:46:55 | 0.000 sec | None | 0.0 | 0 | 0.0 | nan | nan | nan | nan | |
| 2023-07-26 17:46:55 | 1.218 sec | 72562 obs/sec | 1.0 | 1 | 32000.0 | 0.3889035 | 1.9117893 | 0.2986558 | 0.0307148 | |
| 2023-07-26 17:46:56 | 1.646 sec | 92485 obs/sec | 2.0 | 2 | 64000.0 | 0.3879778 | 1.9027225 | 0.3033854 | 0.0353233 | |
| 2023-07-26 17:46:56 | 1.923 sec | 121518 obs/sec | 3.0 | 3 | 96000.0 | 0.3872472 | 1.8996910 | 0.2998927 | 0.0389530 | |
| 2023-07-26 17:46:57 | 2.358 sec | 128256 obs/sec | 4.0 | 4 | 128000.0 | 0.3870386 | 1.8971095 | 0.3009621 | 0.0399884 | |
| 2023-07-26 17:46:57 | 2.843 sec | 127490 obs/sec | 5.0 | 5 | 160000.0 | 0.3872056 | 1.8989280 | 0.2953510 | 0.0391596 | |
| 2023-07-26 17:46:58 | 3.372 sec | 125162 obs/sec | 6.0 | 6 | 192000.0 | 0.3868670 | 1.8961575 | 0.2986335 | 0.0408393 | |
| 2023-07-26 17:46:58 | 3.821 sec | 122270 obs/sec | 7.0 | 7 | 224000.0 | 0.3870500 | 1.8974509 | 0.2988586 | 0.0399319 | |
| 2023-07-26 17:46:58 | 4.219 sec | 126046 obs/sec | 8.0 | 8 | 256000.0 | 0.3872289 | 1.8987255 | 0.2944169 | 0.0390442 | |
| 2023-07-26 17:46:59 | 4.632 sec | 127886 obs/sec | 9.0 | 9 | 288000.0 | 0.3871469 | 1.8980029 | 0.3008467 | 0.0394510 | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 17:48:48 | 1 min 53.331 sec | 498007 obs/sec | 992.0 | 992 | 31744000.0000000 | 0.3888569 | 1.9172578 | 0.2952023 | 0.0309469 | |
| 2023-07-26 17:48:48 | 1 min 53.423 sec | 498056 obs/sec | 993.0 | 993 | 31776000.0000000 | 0.3887163 | 1.9155428 | 0.2993813 | 0.0316473 | |
| 2023-07-26 17:48:48 | 1 min 53.516 sec | 498159 obs/sec | 994.0 | 994 | 31808000.0000000 | 0.3886963 | 1.9153043 | 0.3025350 | 0.0317471 | |
| 2023-07-26 17:48:48 | 1 min 53.609 sec | 498247 obs/sec | 995.0 | 995 | 31840000.0000000 | 0.3887182 | 1.9155005 | 0.2999262 | 0.0316382 | |
| 2023-07-26 17:48:48 | 1 min 53.694 sec | 498358 obs/sec | 996.0 | 996 | 31872000.0000000 | 0.3887248 | 1.9154189 | 0.3029753 | 0.0316053 | |
| 2023-07-26 17:48:48 | 1 min 53.781 sec | 498453 obs/sec | 997.0 | 997 | 31904000.0000000 | 0.3886923 | 1.9152971 | 0.3027701 | 0.0317669 | |
| 2023-07-26 17:48:48 | 1 min 53.869 sec | 498571 obs/sec | 998.0 | 998 | 31936000.0000000 | 0.3887642 | 1.9158048 | 0.2997823 | 0.0314090 | |
| 2023-07-26 17:48:48 | 1 min 53.960 sec | 498635 obs/sec | 999.0 | 999 | 31968000.0000000 | 0.3887361 | 1.9159206 | 0.2986089 | 0.0315489 | |
| 2023-07-26 17:48:48 | 1 min 54.047 sec | 498745 obs/sec | 1000.0 | 1000 | 32000000.0000000 | 0.3887176 | 1.9155962 | 0.3005151 | 0.0316409 | |
| 2023-07-26 17:48:48 | 1 min 54.104 sec | 498597 obs/sec | 1000.0 | 1000 | 32000000.0000000 | 0.3868670 | 1.8961575 | 0.2986335 | 0.0408393 |
[1002 rows x 11 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| AP003_bin_WOE | 1.0 | 1.0 | 0.2347512 |
| CR015_bin_WOE | 0.6420705 | 0.6420705 | 0.1507269 |
| TD009_bin_WOE | 0.5127221 | 0.5127221 | 0.1203622 |
| TD005_WOE | 0.4553512 | 0.4553512 | 0.1068943 |
| TD014_bin_WOE | 0.4146544 | 0.4146544 | 0.0973406 |
| AP008_WOE | 0.2762516 | 0.2762516 | 0.0648504 |
| PA023_bin_WOE | 0.2164090 | 0.2164090 | 0.0508023 |
| PA022_bin_WOE | 0.1987160 | 0.1987160 | 0.0466488 |
| TD001_bin_WOE | 0.1870835 | 0.1870835 | 0.0439181 |
| PA029_bin_WOE | 0.1362096 | 0.1362096 | 0.0319754 |
| TD006_bin_WOE | 0.0837359 | 0.0837359 | 0.0196571 |
| CR019_WOE | 0.0550442 | 0.0550442 | 0.0129217 |
| CR009_bin_WOE | 0.0322155 | 0.0322155 | 0.0075626 |
| TD010_bin_WOE | 0.0253704 | 0.0253704 | 0.0059557 |
| AP001_WOE | 0.0239944 | 0.0239944 | 0.0056327 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
VarImp(dl_v1)
createGains(dl_v1)
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 800 | 228 | 572 | 800 | 228 | 572 | 0.15 | 0.09 | 151.2 | 1.51 | 6.0 | 28.50 |
| 1 | 800 | 204 | 596 | 1600 | 432 | 1168 | 0.29 | 0.18 | 302.4 | 1.43 | 11.0 | 27.00 |
| 2 | 800 | 193 | 607 | 2400 | 625 | 1775 | 0.41 | 0.27 | 453.6 | 1.38 | 14.0 | 26.04 |
| 3 | 800 | 178 | 622 | 3200 | 803 | 2397 | 0.53 | 0.37 | 604.8 | 1.33 | 16.0 | 25.09 |
| 4 | 800 | 145 | 655 | 4000 | 948 | 3052 | 0.63 | 0.47 | 756.0 | 1.25 | 16.0 | 23.70 |
| 5 | 800 | 133 | 667 | 4800 | 1081 | 3719 | 0.71 | 0.57 | 907.2 | 1.19 | 14.0 | 22.52 |
| 6 | 800 | 113 | 687 | 5600 | 1194 | 4406 | 0.79 | 0.68 | 1058.4 | 1.13 | 11.0 | 21.32 |
| 7 | 800 | 124 | 676 | 6400 | 1318 | 5082 | 0.87 | 0.78 | 1209.6 | 1.09 | 9.0 | 20.59 |
| 8 | 800 | 113 | 687 | 7200 | 1431 | 5769 | 0.95 | 0.89 | 1360.8 | 1.05 | 6.0 | 19.88 |
| 9 | 800 | 81 | 719 | 8000 | 1512 | 6488 | 1.00 | 1.00 | 1512.0 | 1.00 | 0.0 | 18.90 |
ROC_AUC(dl_v1,test_hex,'loan_default')
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
#Use 50% training data
train_smpl = train_df_dl.sample(frac=0.5, random_state=1)
test_smpl = test_df_dl.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
# Build and train the model:
dl_v2 = H2ODeepLearningEstimator(distribution="tweedie",
hidden=[15],
epochs=1000,
train_samples_per_iteration=-1,
reproducible=True,
activation="Tanh",
single_node_mode=False,
balance_classes=False,
force_load_balance=False,
seed=23123,
tweedie_power=1.5,
score_training_samples=0,
score_validation_samples=0,
stopping_rounds=0)
dl_v2.train(x=predictors,
y=target,
training_frame=train_hex)
deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%
Model Details ============= H2ODeepLearningEstimator : Deep Learning Model Key: DeepLearning_model_python_1690391326174_7
| layer | units | type | dropout | l1 | l2 | mean_rate | rate_rms | momentum | mean_weight | weight_rms | mean_bias | bias_rms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 15 | Input | 0.0 | ||||||||||
| 2 | 15 | Tanh | 0.0 | 0.0 | 0.0 | 0.0578403 | 0.1227143 | 0.0 | -0.0593437 | 0.4779193 | -0.5849661 | 1.6329694 | |
| 3 | 1 | Linear | 0.0 | 0.0 | 0.0004323 | 0.0001015 | 0.0 | 0.0434363 | 0.1713431 | -0.5021105 | 0.0000000 |
ModelMetricsRegression: deeplearning ** Reported on train data. ** MSE: 0.14942329051863198 RMSE: 0.38655308887477796 MAE: 0.3020197073141356 RMSLE: 0.27189718953532904 Mean Residual Deviance: 1.8937793450713964
| timestamp | duration | training_speed | epochs | iterations | samples | training_rmse | training_deviance | training_mae | training_r2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2023-07-26 18:53:52 | 0.000 sec | None | 0.0 | 0 | 0.0 | nan | nan | nan | nan | |
| 2023-07-26 18:53:54 | 1.970 sec | 19138 obs/sec | 1.0 | 1 | 32000.0 | 0.3881327 | 1.9006293 | 0.2980732 | 0.0345530 | |
| 2023-07-26 18:53:55 | 3.196 sec | 24233 obs/sec | 2.0 | 2 | 64000.0 | 0.3873059 | 1.8969044 | 0.2986840 | 0.0386616 | |
| 2023-07-26 18:53:56 | 4.086 sec | 28012 obs/sec | 3.0 | 3 | 96000.0 | 0.3870877 | 1.8969663 | 0.3064743 | 0.0397447 | |
| 2023-07-26 18:53:57 | 4.605 sec | 32947 obs/sec | 4.0 | 4 | 128000.0 | 0.3867414 | 1.8941267 | 0.2975281 | 0.0414620 | |
| 2023-07-26 18:53:57 | 5.045 sec | 37488 obs/sec | 5.0 | 5 | 160000.0 | 0.3872732 | 1.8983977 | 0.2946867 | 0.0388241 | |
| 2023-07-26 18:53:58 | 5.475 sec | 41406 obs/sec | 6.0 | 6 | 192000.0 | 0.3868850 | 1.8966583 | 0.2902008 | 0.0407501 | |
| 2023-07-26 18:53:58 | 5.901 sec | 44728 obs/sec | 7.0 | 7 | 224000.0 | 0.3871962 | 1.8970791 | 0.2995006 | 0.0392064 | |
| 2023-07-26 18:53:58 | 6.330 sec | 47583 obs/sec | 8.0 | 8 | 256000.0 | 0.3867782 | 1.8951614 | 0.2967038 | 0.0412796 | |
| 2023-07-26 18:53:59 | 6.753 sec | 50130 obs/sec | 9.0 | 9 | 288000.0 | 0.3868773 | 1.8952747 | 0.2960012 | 0.0407884 | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 19:00:37 | 6 min 45.268 sec | 93817 obs/sec | 992.0 | 992 | 31744000.0000000 | 0.3870540 | 1.8995680 | 0.2935097 | 0.0399120 | |
| 2023-07-26 19:00:38 | 6 min 45.611 sec | 93832 obs/sec | 993.0 | 993 | 31776000.0000000 | 0.3874609 | 1.8991271 | 0.2975654 | 0.0378924 | |
| 2023-07-26 19:00:38 | 6 min 45.956 sec | 93846 obs/sec | 994.0 | 994 | 31808000.0000000 | 0.3870855 | 1.8998078 | 0.2924485 | 0.0397554 | |
| 2023-07-26 19:00:38 | 6 min 46.309 sec | 93858 obs/sec | 995.0 | 995 | 31840000.0000000 | 0.3874081 | 1.9044613 | 0.3107757 | 0.0381545 | |
| 2023-07-26 19:00:39 | 6 min 46.657 sec | 93873 obs/sec | 996.0 | 996 | 31872000.0000000 | 0.3870366 | 1.9013690 | 0.3041471 | 0.0399982 | |
| 2023-07-26 19:00:39 | 6 min 47.002 sec | 93888 obs/sec | 997.0 | 997 | 31904000.0000000 | 0.3872504 | 1.9025377 | 0.3066765 | 0.0389372 | |
| 2023-07-26 19:00:39 | 6 min 47.354 sec | 93901 obs/sec | 998.0 | 998 | 31936000.0000000 | 0.3880618 | 1.9069717 | 0.3143574 | 0.0349056 | |
| 2023-07-26 19:00:40 | 6 min 47.711 sec | 93915 obs/sec | 999.0 | 999 | 31968000.0000000 | 0.3872698 | 1.8996847 | 0.3028403 | 0.0388408 | |
| 2023-07-26 19:00:40 | 6 min 48.056 sec | 93929 obs/sec | 1000.0 | 1000 | 32000000.0000000 | 0.3879566 | 1.9117697 | 0.2843278 | 0.0354289 | |
| 2023-07-26 19:00:40 | 6 min 48.130 sec | 93924 obs/sec | 1000.0 | 1000 | 32000000.0000000 | 0.3865531 | 1.8937793 | 0.3020197 | 0.0423953 |
[1002 rows x 11 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| AP003_bin_WOE | 1.0 | 1.0 | 0.1622127 |
| TD014_bin_WOE | 0.6524100 | 0.6524100 | 0.1058292 |
| CR015_bin_WOE | 0.5777300 | 0.5777300 | 0.0937151 |
| TD006_bin_WOE | 0.5119238 | 0.5119238 | 0.0830405 |
| TD009_bin_WOE | 0.4652037 | 0.4652037 | 0.0754619 |
| TD010_bin_WOE | 0.4546128 | 0.4546128 | 0.0737439 |
| TD005_WOE | 0.3899076 | 0.3899076 | 0.0632480 |
| CR019_WOE | 0.3342157 | 0.3342157 | 0.0542140 |
| PA029_bin_WOE | 0.3297463 | 0.3297463 | 0.0534890 |
| CR009_bin_WOE | 0.3015221 | 0.3015221 | 0.0489107 |
| TD001_bin_WOE | 0.2873588 | 0.2873588 | 0.0466132 |
| PA022_bin_WOE | 0.2786159 | 0.2786159 | 0.0451950 |
| PA023_bin_WOE | 0.2548023 | 0.2548023 | 0.0413322 |
| AP008_WOE | 0.1798768 | 0.1798768 | 0.0291783 |
| AP001_WOE | 0.1468209 | 0.1468209 | 0.0238162 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
ROC_AUC(dl_v2,test_hex,'loan_default')
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_dl.columns.tolist()
predictors=predictors[2:17]
values_to_remove = ['AP001_WOE', 'TD010_bin_WOE','CR009_bin_WOE']
predictors = [item for item in predictors if item not in values_to_remove]
predictors
['AP003_bin_WOE', 'AP008_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD014_bin_WOE']
#Use 50% training data and all test data
train_smpl = train_df_dl.sample(frac=0.5, random_state=1)
test_smpl = test_df_dl.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
# Build and train the model:
dl_v3 = H2ODeepLearningEstimator(distribution="tweedie",
hidden=[15],
epochs=1000,
train_samples_per_iteration=-1,
reproducible=True,
activation="Tanh",
single_node_mode=False,
balance_classes=False,
force_load_balance=False,
seed=23123,
tweedie_power=1.5,
score_training_samples=0,
score_validation_samples=0,
stopping_rounds=0)
dl_v3.train(x=predictors,
y=target,
training_frame=train_hex)
deeplearning Model Build progress: |█████████████████████████████████████████████| (done) 100%
Model Details ============= H2ODeepLearningEstimator : Deep Learning Model Key: DeepLearning_model_python_1690391326174_8
| layer | units | type | dropout | l1 | l2 | mean_rate | rate_rms | momentum | mean_weight | weight_rms | mean_bias | bias_rms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 12 | Input | 0.0 | ||||||||||
| 2 | 15 | Tanh | 0.0 | 0.0 | 0.0 | 0.0351888 | 0.0610454 | 0.0 | -0.0618614 | 0.3207351 | 0.1599673 | 1.3066607 | |
| 3 | 1 | Linear | 0.0 | 0.0 | 0.0004093 | 0.0000196 | 0.0 | -0.0319876 | 0.1622897 | -0.8569771 | 0.0000000 |
ModelMetricsRegression: deeplearning ** Reported on train data. ** MSE: 0.1495221752667646 RMSE: 0.38668097349981495 MAE: 0.3011394386382141 RMSLE: 0.27164728953221895 Mean Residual Deviance: 1.8940746121314866
| timestamp | duration | training_speed | epochs | iterations | samples | training_rmse | training_deviance | training_mae | training_r2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2023-07-26 19:04:44 | 0.000 sec | None | 0.0 | 0 | 0.0 | nan | nan | nan | nan | |
| 2023-07-26 19:04:45 | 1.971 sec | 19161 obs/sec | 1.0 | 1 | 32000.0 | 0.3879453 | 1.8996317 | 0.2972218 | 0.0354848 | |
| 2023-07-26 19:04:47 | 3.345 sec | 23503 obs/sec | 2.0 | 2 | 64000.0 | 0.3876217 | 1.8976999 | 0.2994473 | 0.0370933 | |
| 2023-07-26 19:04:48 | 4.212 sec | 28152 obs/sec | 3.0 | 3 | 96000.0 | 0.3871285 | 1.8966283 | 0.3075731 | 0.0395422 | |
| 2023-07-26 19:04:49 | 5.255 sec | 29540 obs/sec | 4.0 | 4 | 128000.0 | 0.3868933 | 1.8949465 | 0.2987211 | 0.0407091 | |
| 2023-07-26 19:04:49 | 5.999 sec | 32566 obs/sec | 5.0 | 5 | 160000.0 | 0.3870850 | 1.8962858 | 0.2952595 | 0.0397579 | |
| 2023-07-26 19:04:50 | 6.946 sec | 33790 obs/sec | 6.0 | 6 | 192000.0 | 0.3869928 | 1.8979457 | 0.2895132 | 0.0402157 | |
| 2023-07-26 19:04:52 | 8.231 sec | 32719 obs/sec | 7.0 | 7 | 224000.0 | 0.3872329 | 1.8974479 | 0.2994648 | 0.0390243 | |
| 2023-07-26 19:04:53 | 9.759 sec | 31569 obs/sec | 8.0 | 8 | 256000.0 | 0.3868305 | 1.8955196 | 0.2969099 | 0.0410204 | |
| 2023-07-26 19:04:54 | 11.046 sec | 31355 obs/sec | 9.0 | 9 | 288000.0 | 0.3871136 | 1.8969040 | 0.2979204 | 0.0396161 | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 2023-07-26 19:10:56 | 6 min 12.565 sec | 102812 obs/sec | 992.0 | 992 | 31744000.0000000 | 0.3871540 | 1.9004186 | 0.2946634 | 0.0394155 | |
| 2023-07-26 19:10:56 | 6 min 12.875 sec | 102829 obs/sec | 993.0 | 993 | 31776000.0000000 | 0.3884846 | 1.9017094 | 0.2992910 | 0.0328013 | |
| 2023-07-26 19:10:57 | 6 min 13.189 sec | 102845 obs/sec | 994.0 | 994 | 31808000.0000000 | 0.3873462 | 1.9022251 | 0.2928724 | 0.0384615 | |
| 2023-07-26 19:10:57 | 6 min 13.514 sec | 102858 obs/sec | 995.0 | 995 | 31840000.0000000 | 0.3877552 | 1.9052790 | 0.3105979 | 0.0364300 | |
| 2023-07-26 19:10:57 | 6 min 13.829 sec | 102875 obs/sec | 996.0 | 996 | 31872000.0000000 | 0.3871145 | 1.9019268 | 0.3053463 | 0.0396115 | |
| 2023-07-26 19:10:58 | 6 min 14.147 sec | 102890 obs/sec | 997.0 | 997 | 31904000.0000000 | 0.3874727 | 1.9045983 | 0.3077272 | 0.0378336 | |
| 2023-07-26 19:10:58 | 6 min 14.462 sec | 102906 obs/sec | 998.0 | 998 | 31936000.0000000 | 0.3881551 | 1.9090649 | 0.3144513 | 0.0344417 | |
| 2023-07-26 19:10:58 | 6 min 14.878 sec | 102905 obs/sec | 999.0 | 999 | 31968000.0000000 | 0.3875725 | 1.9004903 | 0.3021549 | 0.0373376 | |
| 2023-07-26 19:10:59 | 6 min 15.360 sec | 102884 obs/sec | 1000.0 | 1000 | 32000000.0000000 | 0.3882399 | 1.9135242 | 0.2852468 | 0.0340197 | |
| 2023-07-26 19:10:59 | 6 min 15.496 sec | 102873 obs/sec | 1000.0 | 1000 | 32000000.0000000 | 0.3866810 | 1.8940746 | 0.3011394 | 0.0417616 |
[1002 rows x 11 columns]
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| AP003_bin_WOE | 1.0 | 1.0 | 0.1875495 |
| TD005_WOE | 0.7480226 | 0.7480226 | 0.1402912 |
| CR015_bin_WOE | 0.5754523 | 0.5754523 | 0.1079258 |
| TD009_bin_WOE | 0.4181940 | 0.4181940 | 0.0784321 |
| TD014_bin_WOE | 0.4101961 | 0.4101961 | 0.0769321 |
| PA029_bin_WOE | 0.3894845 | 0.3894845 | 0.0730476 |
| CR019_WOE | 0.3676064 | 0.3676064 | 0.0689444 |
| PA023_bin_WOE | 0.3661864 | 0.3661864 | 0.0686781 |
| TD001_bin_WOE | 0.3494761 | 0.3494761 | 0.0655441 |
| PA022_bin_WOE | 0.2866926 | 0.2866926 | 0.0537690 |
| TD006_bin_WOE | 0.2157678 | 0.2157678 | 0.0404671 |
| AP008_WOE | 0.2048479 | 0.2048479 | 0.0384191 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
ROC_AUC(dl_v3,test_hex,'loan_default')
deeplearning prediction progress: |██████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
We get AUC of 60.66% from ROC curve and the precision-recall rate of 25.34%.
Generalized Linear Model (GLM) is a versatile statistical framework for analyzing data and building predictive models. It extends traditional linear regression to handle a wider range of data distributions, making it suitable for various types of data, including binary, count, and continuous outcomes. GLM incorporates a link function to connect the linear predictor to the response variable's distribution, allowing for flexible modeling. It's used for tasks like regression, classification, and more, offering interpretability and adaptability. Regularization techniques like Ridge and LASSO can be applied to control model complexity.
!pip install h2o
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()
Requirement already satisfied: h2o in /usr/local/lib/python3.10/dist-packages (3.42.0.2) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from h2o) (2.31.0) Requirement already satisfied: tabulate in /usr/local/lib/python3.10/dist-packages (from h2o) (0.9.0) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.2.0) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->h2o) (2023.7.22) Checking whether there is an H2O instance running at http://localhost:54321. connected.
| H2O_cluster_uptime: | 8 mins 37 secs |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.42.0.2 |
| H2O_cluster_version_age: | 16 days |
| H2O_cluster_name: | H2O_from_python_unknownUser_dsx5kv |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 3.170 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"} |
| H2O_internal_security: | False |
| Python_version: | 3.10.12 final |
train_df_glm = train_df_rf
test_df_glm = test_df_rf
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_glm.columns.tolist()
predictors=predictors[2:17]
predictors
['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
#Use 50% training data
train_smpl = train_df_glm.sample(frac=0.5, random_state=1)
test_smpl = test_df_glm.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
glm_v1 = H2OGeneralizedLinearEstimator(family= "binomial", lambda_ = 0.05) #, compute_p_values = True)
glm_v1.train(predictors,target,training_frame=train_hex)
glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details ============= H2OGeneralizedLinearEstimator : Generalized Linear Modeling Model Key: GLM_model_python_1691688810087_1
| family | link | regularization | number_of_predictors_total | number_of_active_predictors | number_of_iterations | training_frame | |
|---|---|---|---|---|---|---|---|
| binomial | logit | Elastic Net (alpha = 0.5, lambda = 0.05 ) | 15 | 8 | 4 | Key_Frame__upload_b373c209abd41bd81c47e5a061084af2.hex |
ModelMetricsBinomialGLM: glm ** Reported on train data. ** MSE: 0.1526465109366919 RMSE: 0.3907000267938203 LogLoss: 0.4805152258712424 AUC: 0.6264445896951362 AUCPR: 0.2790494840657378 Gini: 0.25288917939027233 Null degrees of freedom: 31999 Residual degrees of freedom: 31991 Null deviance: 31437.68171069583 Residual deviance: 30752.974455759515 AIC: 30770.974455759515
| 0 | 1 | Error | Rate | |
|---|---|---|---|---|
| 0 | 12789.0 | 13020.0 | 0.5045 | (13020.0/25809.0) |
| 1 | 1976.0 | 4215.0 | 0.3192 | (1976.0/6191.0) |
| Total | 14765.0 | 17235.0 | 0.4686 | (14996.0/32000.0) |
| metric | threshold | value | idx |
|---|---|---|---|
| max f1 | 0.1858362 | 0.3598566 | 266.0 |
| max f2 | 0.1531457 | 0.5526057 | 373.0 |
| max f0point5 | 0.2139063 | 0.3089484 | 165.0 |
| max accuracy | 0.2811279 | 0.8066875 | 5.0 |
| max precision | 0.2844217 | 0.5909091 | 2.0 |
| max recall | 0.1366615 | 1.0 | 399.0 |
| max specificity | 0.2869710 | 0.9998450 | 0.0 |
| max absolute_mcc | 0.2056047 | 0.1504460 | 194.0 |
| max min_per_class_accuracy | 0.1937387 | 0.5894455 | 236.0 |
| max mean_per_class_accuracy | 0.1943942 | 0.5901554 | 233.0 |
| max tns | 0.2869710 | 25805.0 | 0.0 |
| max fns | 0.2869710 | 6187.0 | 0.0 |
| max fps | 0.1366615 | 25809.0 | 399.0 |
| max tps | 0.1366615 | 6191.0 | 399.0 |
| max tnr | 0.2869710 | 0.9998450 | 0.0 |
| max fnr | 0.2869710 | 0.9993539 | 0.0 |
| max fpr | 0.1366615 | 1.0 | 399.0 |
| max tpr | 0.1366615 | 1.0 | 399.0 |
| group | cumulative_data_fraction | lower_threshold | lift | cumulative_lift | response_rate | score | cumulative_response_rate | cumulative_score | capture_rate | cumulative_capture_rate | gain | cumulative_gain | kolmogorov_smirnov |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0100938 | 0.2647237 | 2.1443292 | 2.1443292 | 0.4148607 | 0.2714893 | 0.4148607 | 0.2714893 | 0.0216443 | 0.0216443 | 114.4329155 | 114.4329155 | 0.0143213 |
| 2 | 0.02 | 0.2589252 | 1.5000915 | 1.8252302 | 0.2902208 | 0.2614576 | 0.353125 | 0.2665205 | 0.0148603 | 0.0365046 | 50.0091463 | 82.5230173 | 0.0204637 |
| 3 | 0.030375 | 0.2561017 | 2.0861997 | 1.9143679 | 0.4036145 | 0.2572176 | 0.3703704 | 0.2633430 | 0.0216443 | 0.0581489 | 108.6199750 | 91.4367930 | 0.0344363 |
| 4 | 0.04 | 0.2521292 | 1.8795612 | 1.9059926 | 0.3636364 | 0.2542846 | 0.36875 | 0.2611633 | 0.0180908 | 0.0762397 | 87.9561240 | 90.5992570 | 0.0449328 |
| 5 | 0.05 | 0.2499286 | 1.7606203 | 1.8769181 | 0.340625 | 0.2510795 | 0.363125 | 0.2591465 | 0.0176062 | 0.0938459 | 76.0620255 | 87.6918107 | 0.0543636 |
| 6 | 0.1003125 | 0.2403076 | 1.5923736 | 1.7342026 | 0.3080745 | 0.2450548 | 0.3355140 | 0.2520787 | 0.0801163 | 0.1739622 | 59.2373622 | 73.4202649 | 0.0913166 |
| 7 | 0.1500313 | 0.2307312 | 1.4911855 | 1.6536694 | 0.2884978 | 0.2352361 | 0.3199333 | 0.2464973 | 0.0741399 | 0.2481021 | 49.1185528 | 65.3669377 | 0.1215958 |
| 8 | 0.2000937 | 0.2215173 | 1.2647734 | 1.5563695 | 0.2446941 | 0.2258361 | 0.3011089 | 0.2413279 | 0.0633177 | 0.3114198 | 26.4773419 | 55.6369467 | 0.1380307 |
| 9 | 0.3022813 | 0.2093273 | 1.2645366 | 1.4577141 | 0.2446483 | 0.2149020 | 0.2820221 | 0.2323945 | 0.1292198 | 0.4406396 | 26.4536614 | 45.7714093 | 0.1715475 |
| 10 | 0.400375 | 0.1986757 | 1.0373813 | 1.3547306 | 0.2007009 | 0.2036881 | 0.2620980 | 0.2253613 | 0.1017606 | 0.5424003 | 3.7381283 | 35.4730586 | 0.1760939 |
| 11 | 0.50225 | 0.1882932 | 1.0210745 | 1.2870527 | 0.1975460 | 0.1932596 | 0.2490045 | 0.2188499 | 0.1040220 | 0.6464222 | 2.1074526 | 28.7052714 | 0.1787559 |
| 12 | 0.6055625 | 0.1815325 | 0.9099328 | 1.2227139 | 0.1760436 | 0.1844916 | 0.2365569 | 0.2129882 | 0.0940074 | 0.7404297 | -9.0067222 | 22.2713850 | 0.1672188 |
| 13 | 0.7000937 | 0.1729501 | 0.7518245 | 1.1591313 | 0.1454545 | 0.1771143 | 0.2242557 | 0.2081442 | 0.0710709 | 0.8115006 | -24.8175504 | 15.9131281 | 0.1381308 |
| 14 | 0.8133125 | 0.1623638 | 0.7504238 | 1.1022364 | 0.1451835 | 0.1665942 | 0.2132483 | 0.2023602 | 0.0849620 | 0.8964626 | -24.9576226 | 10.2236357 | 0.1030960 |
| 15 | 0.9146875 | 0.1561530 | 0.6580492 | 1.0530070 | 0.1273120 | 0.1583600 | 0.2037239 | 0.1974836 | 0.0667097 | 0.9631723 | -34.1950777 | 5.3007007 | 0.0601153 |
| 16 | 1.0 | 0.1366615 | 0.4316794 | 1.0 | 0.0835165 | 0.1504080 | 0.1934687 | 0.1934675 | 0.0368277 | 1.0 | -56.8320550 | 0.0 | 0.0 |
| timestamp | duration | iterations | negative_log_likelihood | objective | training_rmse | training_logloss | training_r2 | training_auc | training_pr_auc | training_lift | training_classification_error | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023-08-10 17:43:35 | 0.000 sec | 0 | 15718.8408553 | 0.4912138 | ||||||||
| 2023-08-10 17:43:35 | 0.196 sec | 1 | 15458.4752627 | 0.4882442 | ||||||||
| 2023-08-10 17:43:35 | 0.223 sec | 2 | 15458.9172757 | 0.4882350 | ||||||||
| 2023-08-10 17:43:35 | 0.419 sec | 3 | 15376.9680691 | 0.4879435 | ||||||||
| 2023-08-10 17:43:35 | 0.478 sec | 4 | 15376.4872279 | 0.4879434 | 0.3907000 | 0.4805152 | 0.0217387 | 0.6264446 | 0.2790495 | 2.1443292 | 0.468625 |
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| TD009_bin_WOE | 0.1053973 | 1.0 | 0.3665086 |
| TD005_WOE | 0.0680206 | 0.6453740 | 0.2365351 |
| TD014_bin_WOE | 0.0383866 | 0.3642091 | 0.1334858 |
| AP003_bin_WOE | 0.0334363 | 0.3172407 | 0.1162715 |
| CR015_bin_WOE | 0.0242039 | 0.2296448 | 0.0841668 |
| PA023_bin_WOE | 0.0131986 | 0.1252271 | 0.0458968 |
| PA029_bin_WOE | 0.0046987 | 0.0445805 | 0.0163392 |
| PA022_bin_WOE | 0.0002290 | 0.0021724 | 0.0007962 |
| AP001_WOE | 0.0 | 0.0 | 0.0 |
| AP008_WOE | 0.0 | 0.0 | 0.0 |
| CR009_bin_WOE | 0.0 | 0.0 | 0.0 |
| CR019_WOE | 0.0 | 0.0 | 0.0 |
| TD001_bin_WOE | 0.0 | 0.0 | 0.0 |
| TD006_bin_WOE | 0.0 | 0.0 | 0.0 |
| TD010_bin_WOE | 0.0 | 0.0 | 0.0 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
glm_v1.predict(test_hex)
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| predict | p0 | p1 |
|---|---|---|
| 1 | 0.775178 | 0.224822 |
| 1 | 0.777386 | 0.222614 |
| 1 | 0.781984 | 0.218016 |
| 0 | 0.831052 | 0.168948 |
| 1 | 0.794822 | 0.205178 |
| 1 | 0.787306 | 0.212694 |
| 1 | 0.795895 | 0.204105 |
| 0 | 0.844835 | 0.155165 |
| 0 | 0.834172 | 0.165828 |
| 1 | 0.785259 | 0.214741 |
[8000 rows x 3 columns]
glm_v1.predict(test_hex)['p1']
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| p1 |
|---|
| 0.224822 |
| 0.222614 |
| 0.218016 |
| 0.168948 |
| 0.205178 |
| 0.212694 |
| 0.204105 |
| 0.155165 |
| 0.165828 |
| 0.214741 |
[8000 rows x 1 column]
predictions = glm_v1.predict(test_hex)['p1']
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| loan_default | p1 | |
|---|---|---|
| 0 | 0 | 0.224822 |
| 1 | 0 | 0.222614 |
| 2 | 0 | 0.218016 |
| 3 | 0 | 0.168948 |
| 4 | 0 | 0.205178 |
def createGains(model):
predictions = model.predict(test_hex)['p1']
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
#sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
test_scores = test_scores.sort_values(by='p1',ascending=False)
test_scores['row_id'] = range(0,0+len(test_scores))
test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
#see count by decile
test_scores.loc[test_scores['decile'] == 10]=9
test_scores['decile'].value_counts()
#create gains table
gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
#add features to gains table
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
gains = pd.DataFrame(gains)
return(gains)
createGains(glm_v1)
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 800 | 201 | 599 | 800 | 201 | 599 | 0.13 | 0.09 | 151.2 | 1.33 | 4.0 | 25.12 |
| 1 | 800 | 217 | 583 | 1600 | 418 | 1182 | 0.28 | 0.18 | 302.4 | 1.38 | 10.0 | 26.12 |
| 2 | 800 | 195 | 605 | 2400 | 613 | 1787 | 0.41 | 0.28 | 453.6 | 1.35 | 13.0 | 25.54 |
| 3 | 800 | 171 | 629 | 3200 | 784 | 2416 | 0.52 | 0.37 | 604.8 | 1.30 | 15.0 | 24.50 |
| 4 | 800 | 161 | 639 | 4000 | 945 | 3055 | 0.62 | 0.47 | 756.0 | 1.25 | 15.0 | 23.62 |
| 5 | 800 | 127 | 673 | 4800 | 1072 | 3728 | 0.71 | 0.57 | 907.2 | 1.18 | 14.0 | 22.33 |
| 6 | 800 | 125 | 675 | 5600 | 1197 | 4403 | 0.79 | 0.68 | 1058.4 | 1.13 | 11.0 | 21.38 |
| 7 | 800 | 107 | 693 | 6400 | 1304 | 5096 | 0.86 | 0.79 | 1209.6 | 1.08 | 7.0 | 20.38 |
| 8 | 800 | 117 | 683 | 7200 | 1421 | 5779 | 0.94 | 0.89 | 1360.8 | 1.04 | 5.0 | 19.74 |
| 9 | 800 | 91 | 709 | 8000 | 1512 | 6488 | 1.00 | 1.00 | 1512.0 | 1.00 | 0.0 | 18.90 |
def ROC_AUC(my_result,df,target):
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# ROC
y_actual = df[target].as_data_frame()
y_pred = my_result.predict(df)['p1'].as_data_frame()
fpr = list()
tpr = list()
roc_auc = list()
fpr,tpr,_ = roc_curve(y_actual,y_pred)
roc_auc = auc(fpr,tpr)
# Precision-Recall
average_precision = average_precision_score(y_actual,y_pred)
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# plotting
plt.figure(figsize=(10,4))
# ROC
plt.subplot(1,2,1)
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
precision,recall,_ = precision_recall_curve(y_actual,y_pred)
plt.step(recall,precision,color='b',alpha=0.2,where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.xlim([0.0,1.0])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
plt.show()
ROC_AUC(glm_v1,test_hex,'loan_default')
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
# Print the Coefficients table
coefs = glm_v1._model_json['output']['coefficients_table'].as_data_frame()
coefs = pd.DataFrame(coefs)
coefs.sort_values(by='standardized_coefficients',ascending=False)
| names | coefficients | standardized_coefficients | |
|---|---|---|---|
| 13 | TD009_bin_WOE | 0.305424 | 0.105397 |
| 11 | TD005_WOE | 0.210866 | 0.068021 |
| 15 | TD014_bin_WOE | 0.136815 | 0.038387 |
| 2 | AP003_bin_WOE | 0.169291 | 0.033436 |
| 5 | CR015_bin_WOE | 0.128310 | 0.024204 |
| 8 | PA023_bin_WOE | 0.069992 | 0.013199 |
| 9 | PA029_bin_WOE | 0.023497 | 0.004699 |
| 7 | PA022_bin_WOE | 0.001180 | 0.000229 |
| 1 | AP001_WOE | 0.000000 | 0.000000 |
| 3 | AP008_WOE | 0.000000 | 0.000000 |
| 4 | CR009_bin_WOE | 0.000000 | 0.000000 |
| 6 | CR019_WOE | 0.000000 | 0.000000 |
| 10 | TD001_bin_WOE | 0.000000 | 0.000000 |
| 12 | TD006_bin_WOE | 0.000000 | 0.000000 |
| 14 | TD010_bin_WOE | 0.000000 | 0.000000 |
| 0 | Intercept | -1.413201 | -1.439362 |
To get the best possible model, GLM needs to find the optimal values of the regularization parameters 𝛼 and 𝜆. When performing regularization, penalties are introduced to the model buidling process to avoid overfitting, to reduce variance of the prediction error, and to handle correlated predictors.
Lambda (λ) is a regularization parameter that controls the extent of regularization in models like Ridge Regression, LASSO, and Elastic Net. When λ is 0, no regularization occurs, risking overfitting. Alpha (α) adjusts the balance between LASSO and Ridge penalties in Elastic Net; α=0 implies Ridge, α=1 means LASSO, and 0<α<1 blends both. Ridge (λ>0, α=0) minimizes coefficients with L2 penalty, LASSO (λ>0, α=1) enforces sparsity via L1 penalty, while Elastic Net (λ>0, 0<α<1) combines both penalties, offering flexibility in feature selection and coefficient control.
We'll perform grid search to find the best value for the regularization parameter lambda (λ) in a GLM.
train, valid= train_hex.split_frame(ratios = [.8])
# Example of values to grid over for `lambda`
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch
# select the values for lambda_ to grid over
hyper_params = {'lambda': [1, 0.5, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0]}
# this example uses cartesian grid search because the search space is small
# and we want to see the performance of all models. For a larger search space use
# random grid search instead: {'strategy': "RandomDiscrete"}
# initialize the glm estimator
glm_v2 = H2OGeneralizedLinearEstimator(family = 'binomial')
# build grid search with previously made GLM and hyperparameters
grid = H2OGridSearch(model = glm_v2, hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})
# train using the grid
grid.train(x = predictors, y = target, training_frame = train, validation_frame = valid)
glm Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%
| lambda | model_ids | logloss | |
|---|---|---|---|
| 0.001 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_5 | 0.4729397 | |
| 0.0001 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_6 | 0.4731181 | |
| 1e-05 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_7 | 0.4731423 | |
| 0.0 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_8 | 0.4731491 | |
| 0.01 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_4 | 0.4732497 | |
| 0.1 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_3 | 0.4911780 | |
| 1.0 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_1 | 0.4929956 | |
| 0.5 | Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_2 | 0.4929956 |
# sort the grid models by decreasing AUC
sorted_grid = grid.get_grid(sort_by = 'auc', decreasing = True)
print(sorted_grid)
Hyper-Parameter Search Summary: ordered by decreasing auc
lambda model_ids auc
-- -------- ----------------------------------------------------------- --------
0.001 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_5 0.641245
0.01 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_4 0.641218
0.0001 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_6 0.640806
1e-05 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_7 0.64072
0 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_8 0.640669
0.1 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_3 0.594032
1 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_1 0.5
0.5 Grid_GLM_py_9_sid_ab82_model_python_1691688810087_3_model_2 0.5
After conducting a grid search over different lambda values to find the best regularization strength for the GLM model in terms of binary classification performance, the results are sorted by AUC to evaluate binary classification models' predictive accuracy. Using 50% of the train data, the hyper-parameter search evaluated models with different lambda values. Lambda values of 0.001, 0.01, and 0.0001 had the highest AUCs, indicating better model performance. Lower lambda values led to better results, while higher values and extremes like 0.1, 1, and 0.5 resulted in less effective models.
glm_v2 = H2OGeneralizedLinearEstimator(family= "binomial", lambda_ = 0.001) #, compute_p_values = True)
glm_v2.train(predictors,target,training_frame=train_hex)
glm_v2.predict(test_hex)
glm_v2.predict(test_hex)['p1']
glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100% glm prediction progress: |███████████████████████████████████████████████████████| (done) 100% glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| p1 |
|---|
| 0.357331 |
| 0.294859 |
| 0.253227 |
| 0.16267 |
| 0.221681 |
| 0.247293 |
| 0.177558 |
| 0.0841446 |
| 0.108281 |
| 0.277609 |
[8000 rows x 1 column]
predictions = glm_v2.predict(test_hex)['p1']
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
test_scores.head()
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| loan_default | p1 | |
|---|---|---|
| 0 | 0 | 0.357331 |
| 1 | 0 | 0.294859 |
| 2 | 0 | 0.253227 |
| 3 | 0 | 0.162670 |
| 4 | 0 | 0.221681 |
ROC_AUC(glm_v2,test_hex,'loan_default')
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
AutoML, short for Automated Machine Learning, is a powerful tool that automates the process of building and optimizing machine learning models. It streamlines and accelerates the complex steps involved in model selection, feature engineering, hyperparameter tuning, and ensemble building. AutoML algorithms search through a variety of model architectures, preprocessing techniques, and hyperparameter configurations to find the best-performing model for a given task. It reduces the need for manual trial-and-error, making machine learning accessible to a wider range of users, including those without extensive data science expertise. AutoML helps in faster model development, improved model accuracy, and increased efficiency in deploying machine learning solutions across different domains and applications.
import h2o
from h2o.automl import H2OAutoML
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
| H2O_cluster_uptime: | 1 hour 46 mins |
| H2O_cluster_timezone: | Etc/UTC |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.42.0.2 |
| H2O_cluster_version_age: | 16 days |
| H2O_cluster_name: | H2O_from_python_unknownUser_dsx5kv |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 3.166 Gb |
| H2O_cluster_total_cores: | 2 |
| H2O_cluster_allowed_cores: | 2 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null, "colab_language_server": "/usr/colab/bin/language_service"} |
| H2O_internal_security: | False |
| Python_version: | 3.10.12 final |
train_df_auto = train_df_rf
test_df_auto = test_df_rf
#Use all the features first for testing
target = 'loan_default'
predictors = train_df_auto.columns.tolist()
predictors=predictors[2:17]
predictors
['AP001_WOE', 'AP003_bin_WOE', 'AP008_WOE', 'CR009_bin_WOE', 'CR015_bin_WOE', 'CR019_WOE', 'PA022_bin_WOE', 'PA023_bin_WOE', 'PA029_bin_WOE', 'TD001_bin_WOE', 'TD005_WOE', 'TD006_bin_WOE', 'TD009_bin_WOE', 'TD010_bin_WOE', 'TD014_bin_WOE']
#Use 50% training data
train_smpl = train_df_auto.sample(frac=0.5, random_state=1)
test_smpl = test_df_auto.sample(frac=0.5, random_state=1)
train_hex = h2o.H2OFrame(train_smpl)
test_hex = h2o.H2OFrame(test_smpl)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Run AutoML, stopping after 60 seconds. The max_runtime_secs argument provides a way to limit the AutoML run by time. When using a time-limited stopping criterion, the number of models train will vary between runs. If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another.
The test frame is passed explicitly to the leaderboard_frame argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml_v1 = H2OAutoML(max_runtime_secs = 60, max_models=20, seed=1)
aml_v1.train(predictors,target,training_frame=train_hex)
AutoML progress: | 19:21:17.952: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ████████████ 19:21:29.877: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. 19:21:31.109: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ████████ 19:21:38.427: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ███████████ 19:21:48.894: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. █████████████████ 19:22:04.364: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ██ 19:22:06.316: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ██ 19:22:08.307: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ███ 19:22:10.729: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ██ 19:22:13.109: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training. ██████| (done) 100%
Model Details ============= H2OGeneralizedLinearEstimator : Generalized Linear Modeling Model Key: GLM_1_AutoML_1_20230810_192117
| family | link | regularization | lambda_search | number_of_predictors_total | number_of_active_predictors | number_of_iterations | training_frame | |
|---|---|---|---|---|---|---|---|---|
| gaussian | identity | Ridge ( lambda = 0.0117 ) | nlambda = 30, lambda.max = 5.7259, lambda.min = 0.0117, lambda.1se = -1.0 | 15 | 15 | 14 | AutoML_1_20230810_192117_training_Key_Frame__upload_932d3d6ace027d22c9061cf756964de2.hex |
ModelMetricsRegressionGLM: glm ** Reported on train data. ** MSE: 0.14996128630919411 RMSE: 0.3872483522356088 MAE: 0.3002231894050112 RMSLE: 0.2717943242231826 Mean Residual Deviance: 0.14996128630919411 R^2: 0.04083176836665714 Null degrees of freedom: 22384 Residual degrees of freedom: 22369 Null deviance: 3499.785838731167 Residual deviance: 3356.88339403131 AIC: 21087.069151105054
ModelMetricsRegressionGLM: glm ** Reported on validation data. ** MSE: 0.1475144003497862 RMSE: 0.38407603459443573 MAE: 0.2980889742140967 RMSLE: 0.27059565475447794 Mean Residual Deviance: 0.1475144003497862 R^2: 0.03373079160541492 Null degrees of freedom: 3169 Residual degrees of freedom: 3154 Null deviance: 484.0569529248307 Residual deviance: 467.6206491088222 AIC: 2963.2308537435215
| timestamp | duration | iteration | lambda | predictors | deviance_train | deviance_test | alpha | iterations | training_rmse | training_deviance | training_mae | training_r2 | validation_rmse | validation_deviance | validation_mae | validation_r2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023-08-10 19:21:30 | 0.000 sec | 1 | .57E1 | 16 | 0.1529860 | 0.1498662 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.062 sec | 2 | .36E1 | 16 | 0.1522020 | 0.1492368 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.130 sec | 3 | .22E1 | 16 | 0.1515042 | 0.1486899 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.204 sec | 4 | .14E1 | 16 | 0.1509490 | 0.1482623 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.243 sec | 5 | .85E0 | 16 | 0.1505483 | 0.1479579 | 0.0 | 5 | 0.3872484 | 0.1499613 | 0.3002232 | 0.0408318 | 0.3840760 | 0.1475144 | 0.2980890 | 0.0337308 | |
| 2023-08-10 19:21:30 | 0.285 sec | 6 | .53E0 | 16 | 0.1502847 | 0.1477589 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.319 sec | 7 | .33E0 | 16 | 0.1501273 | 0.1476415 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.328 sec | 8 | .2E0 | 16 | 0.1500414 | 0.1475783 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.364 sec | 9 | .13E0 | 16 | 0.1499981 | 0.1475466 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.375 sec | 10 | .79E-1 | 16 | 0.1499774 | 0.1475310 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.384 sec | 11 | .49E-1 | 16 | 0.1499680 | 0.1475228 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.392 sec | 12 | .3E-1 | 16 | 0.1499638 | 0.1475182 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.399 sec | 13 | .19E-1 | 16 | 0.1499620 | 0.1475157 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.407 sec | 14 | .12E-1 | 16 | 0.1499613 | 0.1475144 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.414 sec | 15 | .73E-2 | 16 | 0.1499609 | 0.1475129 | 0.0 | None |
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| AP003_bin_WOE | 0.0292007 | 1.0 | 0.1675982 |
| TD009_bin_WOE | 0.0257420 | 0.8815538 | 0.1477468 |
| CR015_bin_WOE | 0.0255326 | 0.8743841 | 0.1465452 |
| TD014_bin_WOE | 0.0173098 | 0.5927868 | 0.0993500 |
| TD005_WOE | 0.0148332 | 0.5079733 | 0.0851354 |
| PA023_bin_WOE | 0.0122045 | 0.4179545 | 0.0700484 |
| AP008_WOE | 0.0119189 | 0.4081737 | 0.0684092 |
| TD001_bin_WOE | 0.0106739 | 0.3655366 | 0.0612633 |
| PA029_bin_WOE | 0.0104481 | 0.3578045 | 0.0599674 |
| PA022_bin_WOE | 0.0048242 | 0.1652069 | 0.0276884 |
| CR019_WOE | 0.0043454 | 0.1488130 | 0.0249408 |
| TD010_bin_WOE | 0.0034180 | 0.1170529 | 0.0196178 |
| AP001_WOE | 0.0020142 | 0.0689766 | 0.0115603 |
| CR009_bin_WOE | 0.0010764 | 0.0368631 | 0.0061782 |
| TD006_bin_WOE | 0.0006883 | 0.0235725 | 0.0039507 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
Next, we will view the AutoML Leaderboard. Since we specified a leaderboard_frame in the H2OAutoML.train() method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.
A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric. In the case of regression, the default ranking metric is mean residual deviance. In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.
Now we will view a snapshot of the top models. Here we should see the GLM at the top of the leaderboard.
aml_v1.leaderboard.head()
| model_id | rmse | mse | mae | rmsle | mean_residual_deviance |
|---|---|---|---|---|---|
| GLM_1_AutoML_1_20230810_192117 | 0.384076 | 0.147514 | 0.298089 | 0.270596 | 0.147514 |
| GBM_2_AutoML_1_20230810_192117 | 0.385505 | 0.148614 | 0.297588 | 0.271512 | 0.148614 |
| GBM_1_AutoML_1_20230810_192117 | 0.386179 | 0.149134 | 0.298923 | 0.272413 | 0.149134 |
| GBM_3_AutoML_1_20230810_192117 | 0.387704 | 0.150314 | 0.299725 | 0.273626 | 0.150314 |
| GBM_4_AutoML_1_20230810_192117 | 0.388605 | 0.151014 | 0.297659 | 0.274501 | 0.151014 |
| XGBoost_3_AutoML_1_20230810_192117 | 0.389388 | 0.151623 | 0.298836 | 0.274869 | 0.151623 |
| DRF_1_AutoML_1_20230810_192117 | 0.399637 | 0.15971 | 0.308563 | 0.285488 | 0.15971 |
| XRT_1_AutoML_1_20230810_192117 | 0.402102 | 0.161686 | 0.30966 | 0.288018 | 0.161686 |
| XGBoost_2_AutoML_1_20230810_192117 | 0.418907 | 0.175483 | 0.312766 | 0.301935 | 0.175483 |
| XGBoost_1_AutoML_1_20230810_192117 | 0.428606 | 0.183703 | 0.318993 | 0.311319 | 0.183703 |
[10 rows x 6 columns]
The ranking displays the performance of various machine learning models based on different evaluation metrics. Models are compared using metrics such as Root Mean Squared Error (RMSE), Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Logarithmic Error (RMSLE), and Mean Residual Deviance. Lower values for these metrics indicate better model performance. In this case, the top-performing model is "GLM_1," with the lowest RMSE, MSE, MAE, RMSLE, and mean residual deviance values, showcasing its accurate predictive ability. The subsequent models, such as "GBM" and "XGBoost," exhibit slightly higher error values. While the "DRF" and "XRT" models have larger errors, the "XGBoost_1" model ranks last with the highest error metrics. Overall, the ranking helps in selecting the best-performing model based on these evaluation criteria.
If you need to generate predictions on a test set, you can make predictions on the "H2OAutoML" object directly, or on the leader model object.
pred = aml_v1.predict(test_hex)
pred.head()
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| predict |
|---|
| 0.338041 |
| 0.29501 |
| 0.252813 |
| 0.172356 |
| 0.228082 |
| 0.253299 |
| 0.193611 |
| 0.0664591 |
| 0.101311 |
| 0.279401 |
[10 rows x 1 column]
model_performance() method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.¶perf = aml_v1.leader.model_performance(test_hex)
perf
ModelMetricsRegressionGLM: glm ** Reported on test data. ** MSE: 0.15068694895577303 RMSE: 0.3881841688628904 MAE: 0.3000688362044224 RMSLE: 0.27298560774821146 Mean Residual Deviance: 0.15068694895577303 R^2: 0.016910672983428743 Null degrees of freedom: 7999 Residual degrees of freedom: 7984 Null deviance: 1226.4295416639784 Residual deviance: 1205.4955916461843 AIC: 7596.610291955
# Get the best model from AutoML
best_model = aml_v1.leader
best_model
Model Details ============= H2OGeneralizedLinearEstimator : Generalized Linear Modeling Model Key: GLM_1_AutoML_1_20230810_192117
| family | link | regularization | lambda_search | number_of_predictors_total | number_of_active_predictors | number_of_iterations | training_frame | |
|---|---|---|---|---|---|---|---|---|
| gaussian | identity | Ridge ( lambda = 0.0117 ) | nlambda = 30, lambda.max = 5.7259, lambda.min = 0.0117, lambda.1se = -1.0 | 15 | 15 | 14 | AutoML_1_20230810_192117_training_Key_Frame__upload_932d3d6ace027d22c9061cf756964de2.hex |
ModelMetricsRegressionGLM: glm ** Reported on train data. ** MSE: 0.14996128630919411 RMSE: 0.3872483522356088 MAE: 0.3002231894050112 RMSLE: 0.2717943242231826 Mean Residual Deviance: 0.14996128630919411 R^2: 0.04083176836665714 Null degrees of freedom: 22384 Residual degrees of freedom: 22369 Null deviance: 3499.785838731167 Residual deviance: 3356.88339403131 AIC: 21087.069151105054
ModelMetricsRegressionGLM: glm ** Reported on validation data. ** MSE: 0.1475144003497862 RMSE: 0.38407603459443573 MAE: 0.2980889742140967 RMSLE: 0.27059565475447794 Mean Residual Deviance: 0.1475144003497862 R^2: 0.03373079160541492 Null degrees of freedom: 3169 Residual degrees of freedom: 3154 Null deviance: 484.0569529248307 Residual deviance: 467.6206491088222 AIC: 2963.2308537435215
| timestamp | duration | iteration | lambda | predictors | deviance_train | deviance_test | alpha | iterations | training_rmse | training_deviance | training_mae | training_r2 | validation_rmse | validation_deviance | validation_mae | validation_r2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023-08-10 19:21:30 | 0.000 sec | 1 | .57E1 | 16 | 0.1529860 | 0.1498662 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.062 sec | 2 | .36E1 | 16 | 0.1522020 | 0.1492368 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.130 sec | 3 | .22E1 | 16 | 0.1515042 | 0.1486899 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.204 sec | 4 | .14E1 | 16 | 0.1509490 | 0.1482623 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.243 sec | 5 | .85E0 | 16 | 0.1505483 | 0.1479579 | 0.0 | 5 | 0.3872484 | 0.1499613 | 0.3002232 | 0.0408318 | 0.3840760 | 0.1475144 | 0.2980890 | 0.0337308 | |
| 2023-08-10 19:21:30 | 0.285 sec | 6 | .53E0 | 16 | 0.1502847 | 0.1477589 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.319 sec | 7 | .33E0 | 16 | 0.1501273 | 0.1476415 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.328 sec | 8 | .2E0 | 16 | 0.1500414 | 0.1475783 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.364 sec | 9 | .13E0 | 16 | 0.1499981 | 0.1475466 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.375 sec | 10 | .79E-1 | 16 | 0.1499774 | 0.1475310 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.384 sec | 11 | .49E-1 | 16 | 0.1499680 | 0.1475228 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.392 sec | 12 | .3E-1 | 16 | 0.1499638 | 0.1475182 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.399 sec | 13 | .19E-1 | 16 | 0.1499620 | 0.1475157 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.407 sec | 14 | .12E-1 | 16 | 0.1499613 | 0.1475144 | 0.0 | None | |||||||||
| 2023-08-10 19:21:30 | 0.414 sec | 15 | .73E-2 | 16 | 0.1499609 | 0.1475129 | 0.0 | None |
| variable | relative_importance | scaled_importance | percentage |
|---|---|---|---|
| AP003_bin_WOE | 0.0292007 | 1.0 | 0.1675982 |
| TD009_bin_WOE | 0.0257420 | 0.8815538 | 0.1477468 |
| CR015_bin_WOE | 0.0255326 | 0.8743841 | 0.1465452 |
| TD014_bin_WOE | 0.0173098 | 0.5927868 | 0.0993500 |
| TD005_WOE | 0.0148332 | 0.5079733 | 0.0851354 |
| PA023_bin_WOE | 0.0122045 | 0.4179545 | 0.0700484 |
| AP008_WOE | 0.0119189 | 0.4081737 | 0.0684092 |
| TD001_bin_WOE | 0.0106739 | 0.3655366 | 0.0612633 |
| PA029_bin_WOE | 0.0104481 | 0.3578045 | 0.0599674 |
| PA022_bin_WOE | 0.0048242 | 0.1652069 | 0.0276884 |
| CR019_WOE | 0.0043454 | 0.1488130 | 0.0249408 |
| TD010_bin_WOE | 0.0034180 | 0.1170529 | 0.0196178 |
| AP001_WOE | 0.0020142 | 0.0689766 | 0.0115603 |
| CR009_bin_WOE | 0.0010764 | 0.0368631 | 0.0061782 |
| TD006_bin_WOE | 0.0006883 | 0.0235725 | 0.0039507 |
[tips] Use `model.explain()` to inspect the model. -- Use `h2o.display.toggle_user_tips()` to switch on/off this section.
def createGains(model):
predictions = model.predict(test_hex)
test_scores = test_hex['loan_default'].cbind(predictions).as_data_frame()
#sort on prediction (descending), add id, and decile for groups containing 1/10 of datapoints
test_scores = test_scores.sort_values(by='predict',ascending=False)
test_scores['row_id'] = range(0,0+len(test_scores))
test_scores['decile'] = ( test_scores['row_id'] / (len(test_scores)/10) ).astype(int)
#see count by decile
test_scores.loc[test_scores['decile'] == 10]=9
test_scores['decile'].value_counts()
#create gains table
gains = test_scores.groupby('decile')['loan_default'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
#add features to gains table
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain']=(gains['cum_actual']/gains['cum_count']*100).round(2)
gains = pd.DataFrame(gains)
return(gains)
createGains(best_model)
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | ||||||||||||
| 0 | 800 | 230 | 570 | 800 | 230 | 570 | 0.15 | 0.09 | 151.2 | 1.52 | 6.0 | 28.75 |
| 1 | 800 | 197 | 603 | 1600 | 427 | 1173 | 0.28 | 0.18 | 302.4 | 1.41 | 10.0 | 26.69 |
| 2 | 800 | 198 | 602 | 2400 | 625 | 1775 | 0.41 | 0.27 | 453.6 | 1.38 | 14.0 | 26.04 |
| 3 | 800 | 189 | 611 | 3200 | 814 | 2386 | 0.54 | 0.37 | 604.8 | 1.35 | 17.0 | 25.44 |
| 4 | 800 | 147 | 653 | 4000 | 961 | 3039 | 0.64 | 0.47 | 756.0 | 1.27 | 17.0 | 24.02 |
| 5 | 800 | 112 | 688 | 4800 | 1073 | 3727 | 0.71 | 0.57 | 907.2 | 1.18 | 14.0 | 22.35 |
| 6 | 800 | 109 | 691 | 5600 | 1182 | 4418 | 0.78 | 0.68 | 1058.4 | 1.12 | 10.0 | 21.11 |
| 7 | 800 | 124 | 676 | 6400 | 1306 | 5094 | 0.86 | 0.79 | 1209.6 | 1.08 | 7.0 | 20.41 |
| 8 | 800 | 107 | 693 | 7200 | 1413 | 5787 | 0.93 | 0.89 | 1360.8 | 1.04 | 4.0 | 19.62 |
| 9 | 800 | 99 | 701 | 8000 | 1512 | 6488 | 1.00 | 1.00 | 1512.0 | 1.00 | 0.0 | 18.90 |
def ROC_AUC(my_result,df,target):
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# ROC
y_actual = df[target].as_data_frame()
y_pred = my_result.predict(df).as_data_frame()
fpr = list()
tpr = list()
roc_auc = list()
fpr,tpr,_ = roc_curve(y_actual,y_pred)
roc_auc = auc(fpr,tpr)
# Precision-Recall
average_precision = average_precision_score(y_actual,y_pred)
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# plotting
plt.figure(figsize=(10,4))
# ROC
plt.subplot(1,2,1)
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: AUC={0:0.4f}'.format(roc_auc))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
precision,recall,_ = precision_recall_curve(y_actual,y_pred)
plt.step(recall,precision,color='b',alpha=0.2,where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.xlim([0.0,1.0])
plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
plt.show()
ROC_AUC(best_model,test_hex,'loan_default')
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100% * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
SHAP (SHapley Additive exPlanations) is a technique in explainable AI that quantifies the contribution of each feature to a model's predictions. It calculates values representing how much each feature influences predictions, considering interactions. SHAP values enable clear feature importance ranking and help interpret complex models. They enhance model transparency, aiding users in understanding decision-making processes.
#Concatenate along rows (vertically)
data_shap = pd.concat([train_df_rf, test_df_rf])
data_shap = data_shap.sort_values(by='id', ascending=True)
data_shap
| id | loan_default | AP001_WOE | AP003_bin_WOE | AP008_WOE | CR009_bin_WOE | CR015_bin_WOE | CR019_WOE | PA022_bin_WOE | PA023_bin_WOE | PA029_bin_WOE | TD001_bin_WOE | TD005_WOE | TD006_bin_WOE | TD009_bin_WOE | TD010_bin_WOE | TD014_bin_WOE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15109 | 1 | 1 | 0.01 | 0.07 | 0.02 | 0.07 | 0.19 | 0.14 | -0.15 | -0.13 | -0.14 | -0.24 | 0.04 | -0.14 | 0.04 | -0.24 | -0.08 |
| 24229 | 2 | 0 | 0.10 | 0.07 | 0.09 | 0.08 | -0.27 | -0.20 | -0.15 | -0.13 | -0.14 | 0.02 | -0.03 | -0.14 | -0.18 | -0.24 | -0.08 |
| 56026 | 3 | 0 | -0.04 | -0.50 | -0.09 | 0.07 | 0.19 | 0.12 | -0.15 | -0.13 | -0.14 | 0.02 | 0.04 | -0.14 | 0.04 | -0.24 | -0.30 |
| 22834 | 4 | 0 | -0.03 | -0.50 | 0.11 | -0.09 | 0.08 | -0.05 | -0.15 | -0.13 | -0.14 | -0.24 | -0.44 | -0.14 | -0.49 | -0.24 | -0.30 |
| 2642 | 5 | 0 | -0.04 | 0.07 | 0.09 | -0.09 | -0.27 | -0.20 | -0.15 | -0.13 | -0.14 | 0.02 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 51386 | 79996 | 0 | -0.10 | 0.07 | 0.02 | 0.07 | 0.08 | -0.09 | -0.15 | -0.13 | -0.14 | 0.02 | -0.22 | -0.14 | -0.18 | -0.24 | 0.14 |
| 17903 | 79997 | 0 | 0.01 | -0.50 | 0.09 | 0.08 | 0.08 | 0.02 | -0.15 | -0.13 | -0.14 | -0.24 | -0.22 | -0.14 | -0.49 | -0.24 | -0.30 |
| 16471 | 79998 | 0 | -0.14 | 0.07 | 0.02 | -0.09 | 0.19 | -0.09 | -0.15 | -0.13 | -0.14 | -0.24 | -0.51 | 0.11 | -0.49 | 0.00 | -0.08 |
| 36131 | 79999 | 0 | -0.05 | 0.07 | -0.09 | 0.07 | 0.19 | 0.02 | -0.15 | -0.13 | -0.14 | -0.24 | -0.44 | -0.14 | -0.49 | -0.24 | -0.30 |
| 42494 | 80000 | 1 | 0.04 | 0.07 | 0.02 | 0.07 | 0.08 | 0.02 | 0.22 | 0.26 | 0.07 | 0.39 | 0.41 | 0.40 | 0.17 | 0.45 | 0.48 |
80000 rows × 17 columns
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
predictors = data_shap.columns.tolist()
predictors=predictors[2:17]
predictors
Y = data_shap['loan_default']
X = data_shap[predictors]
#Train-test split on the features (X) and target (Y) data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
#max_depth=6: This sets the maximum depth of each tree in the forest to 6. It limits how deep each individual tree can grow, helping to control overfitting.the code.
#n_estimators=10: This specifies the number of decision trees (estimators) to create in the random forest ensemble.
model = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
model.fit(X_train, Y_train)
print(model.feature_importances_)
#which features (input variables) were most influential in making predictions
[0.03332825 0.15531092 0.04499582 0.01275476 0.02906626 0.03585508 0.02209713 0.03377747 0.07034303 0.01827997 0.11864086 0.00602769 0.31841822 0.02206263 0.07904192]
importances = model.feature_importances_
indices = np.argsort(importances)
features = X_train.columns
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
predictors = data_shap.columns.tolist()
predictors= ['AP001_WOE',
'AP003_bin_WOE',
'AP008_WOE',
'CR015_bin_WOE',
'PA022_bin_WOE',
'PA023_bin_WOE',
'PA029_bin_WOE',
'TD005_WOE',
'TD009_bin_WOE',
'TD014_bin_WOE']
Y = data_shap['loan_default']
X = data_shap[predictors]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
#ARCHFLAGS="-arch x86_64"
#!pip3 install shap
!pip install git+https://github.com/slundberg/shap.git
import shap
#'check_additivity=False' disables the additivity check for faster computation
shap_values = shap.TreeExplainer(model).shap_values(X_train, check_additivity=False)
# Determine the correlation in order to plot with different colors
corrlist = np.zeros(len(predictors))
X_train_np = X_train.to_numpy() # our X_train is a pandas data frame. Convert it to numpy
for i in range(0,len(predictors) ):
tmp = np.corrcoef(shap_values[:,i],X_train_np[:,i])
corrlist[i] = tmp[0][1]
Collecting git+https://github.com/slundberg/shap.git Cloning https://github.com/slundberg/shap.git to /private/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/pip-req-build-pqzk4pft Running command git clone --filter=blob:none --quiet https://github.com/slundberg/shap.git /private/var/folders/jl/pdyb2sq53l1_msbfhzzlrt6m0000gn/T/pip-req-build-pqzk4pft Resolved https://github.com/slundberg/shap.git to commit ec17a2604127c16b83caaf8e3b4d10eeadaa73ee Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: numpy in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.24.1) Requirement already satisfied: scipy in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.10.1) Requirement already satisfied: scikit-learn in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.2.2) Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (1.5.3) Requirement already satisfied: tqdm>=4.27.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (4.65.0) Requirement already satisfied: packaging>20.9 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (23.0) Requirement already satisfied: slicer==0.0.7 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (0.0.7) Requirement already satisfied: numba in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (0.57.1) Requirement already satisfied: cloudpickle in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from shap==0.42.1) (2.2.1) Requirement already satisfied: llvmlite<0.41,>=0.40.0dev0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from numba->shap==0.42.1) (0.40.1) Requirement already satisfied: python-dateutil>=2.8.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pandas->shap==0.42.1) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from pandas->shap==0.42.1) (2022.7.1) Requirement already satisfied: joblib>=1.1.1 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from scikit-learn->shap==0.42.1) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from scikit-learn->shap==0.42.1) (3.1.0) Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from python-dateutil>=2.8.1->pandas->shap==0.42.1) (1.16.0) [notice] A new release of pip is available: 23.1.2 -> 23.2.1 [notice] To update, run: pip install --upgrade pip
corrlist
# The correlation coefficient measures the strength and direction of the linear relationship between two variables.
# In this context, it helps understand how the SHAP values are related to the actual feature values. After this loop completes, corrlist will contain the correlation coefficients for each feature, indicating how much the SHAP values and the actual feature values align.
array([0.23669973, 0.68691416, 0.78960678, 0.05166419, 0.94047705,
0.816006 , 0.96310438, 0.72578309, 0.80402505, 0.69202017])
# Calculate the absolute SHAP values
shap_v_abs = np.abs(shap_values)
shap_v_abs_mean = shap_v_abs.mean(axis=0)
shap_v_abs_mean
array([0.00128068, 0.03765104, 0.00477671, 0.00717649, 0.00687697,
0.00435646, 0.00428661, 0.00684885, 0.01171429, 0.00281381])
k = pd.DataFrame({'Variables': predictors, 'abs_SHAP': shap_v_abs_mean}).reset_index()
k
| index | Variables | abs_SHAP | |
|---|---|---|---|
| 0 | 0 | AP001_WOE | 0.001281 |
| 1 | 1 | AP003_bin_WOE | 0.037651 |
| 2 | 2 | AP008_WOE | 0.004777 |
| 3 | 3 | CR015_bin_WOE | 0.007176 |
| 4 | 4 | PA022_bin_WOE | 0.006877 |
| 5 | 5 | PA023_bin_WOE | 0.004356 |
| 6 | 6 | PA029_bin_WOE | 0.004287 |
| 7 | 7 | TD005_WOE | 0.006849 |
| 8 | 8 | TD009_bin_WOE | 0.011714 |
| 9 | 9 | TD014_bin_WOE | 0.002814 |
shap.summary_plot(shap_values, X_train, plot_type="bar")
Can the above variable importance plot show the directions between the features and the target variable? Yes, that's the power of the Shap value plot as shown below. This plot is made of many dots. Each dot has three characteristics. The graph below plots the SHAP values of every feature for every sample. It shorts features by the total of absolute SHAP values over all samples. The color represents the feature value (red high, blue low).
def ABS_SHAP(df_shap,df):
#import matplotlib as plt
# Make a copy of the input data
shap_v = pd.DataFrame(df_shap)
feature_list = df.columns
shap_v.columns = feature_list
df_v = df.copy().reset_index().drop('index',axis=1)
# Determine the correlation in order to plot with different colors
corr_list = list()
for i in feature_list:
b = np.corrcoef(shap_v[i],df_v[i])[1][0]
corr_list.append(b)
corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
# Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
corr_df.columns = ['Variable','Corr']
corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
# Plot it
shap_abs = np.abs(shap_v)
k=pd.DataFrame(shap_abs.mean()).reset_index()
k.columns = ['Variable','SHAP_abs']
k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
k2 = k2.sort_values(by='SHAP_abs',ascending = True)
colorlist = k2['Sign']
ax = k2.plot.barh(x='Variable',y='SHAP_abs',color = colorlist, figsize=(6,4),legend=False)
ax.set_xlabel("SHAP Value (Red = Positive Impact)")
ABS_SHAP(shap_values,X_train)
shap.summary_plot(shap_values, X_train)
#generates a summary plot using the SHAP values and the training data
To understand how a single feature effects the output of the model we can plot the SHAP value of that feature vs. the value of the feature for all the examples in a dataset. Vertical dispersion at a single value represents interaction effects with other features. To help reveal these interactions dependence_plot automatically selects another feature for coloring.
shap.dependence_plot("TD009_bin_WOE", shap_values, X_train)
# In this case AP003_bin_WOE highlights that it has more impact on loan default than TD009_bin_WOE.
shap_interaction_values = shap.TreeExplainer(model).shap_interaction_values(X_train.iloc[:2000,:])
shap.summary_plot(shap_interaction_values, X_train.iloc[:2000,:])
shap.initjs()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train, check_additivity=False)
shap.force_plot(explainer.expected_value, shap_values[0])
AP003_bin_WOE (AP003 - CODE_EDUCATION):
TD009_bin_WOE (TD009 - TD_CNT_QUERY_LAST_3MON_P2P):
PA022_bin_WOE (PA022 - DAYS_BTW_APPLICATION_AND_FIRST_COLLECTION_OR_HIGH_RISK_CALL):
CR015_bin_WOE (CR015 - MONTH_CREDIT_CARD_MOB_MAX):
TD005_WOE (TD005 - TD_CNT_QUERY_LAST_1MON_P2P):